[PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure

Bharata B Rao posted 7 patches 1 month, 1 week ago
Documentation/admin-guide/mm/index.rst |   1 +
Documentation/admin-guide/mm/pghot.rst |  80 ++++
arch/x86/Kconfig                       |  16 +
arch/x86/include/asm/ibs-caps.h        |  93 ++++
arch/x86/include/asm/ibs-mprof.h       |  46 ++
arch/x86/include/asm/msr-index.h       |   8 +
arch/x86/include/asm/perf_event.h      |  81 +---
arch/x86/mm/Makefile                   |   1 +
arch/x86/mm/ibs-mprof.c                | 308 ++++++++++++
include/linux/cpuhotplug.h             |   1 +
include/linux/migrate.h                |   9 +-
include/linux/mm.h                     |  35 +-
include/linux/mmzone.h                 |  24 +-
include/linux/pghot.h                  | 113 +++++
include/linux/vm_event_item.h          |  11 +
init/Kconfig                           |  13 +
kernel/sched/core.c                    |   7 +
kernel/sched/debug.c                   |   1 -
kernel/sched/fair.c                    | 177 +------
kernel/sched/sched.h                   |   1 -
mm/Kconfig                             |  34 ++
mm/Makefile                            |   6 +
mm/huge_memory.c                       |  24 +-
mm/memcontrol.c                        |   6 +-
mm/memory-tiers.c                      |  15 +-
mm/memory.c                            |  28 +-
mm/mempolicy.c                         |   3 -
mm/migrate.c                           |  98 +++-
mm/mm_init.c                           |  10 +
mm/pghot-default.c                     |  79 +++
mm/pghot-precise.c                     |  81 ++++
mm/pghot-tunables.c                    | 182 +++++++
mm/pghot.c                             | 633 +++++++++++++++++++++++++
mm/vmstat.c                            |  13 +-
34 files changed, 1922 insertions(+), 316 deletions(-)
create mode 100644 Documentation/admin-guide/mm/pghot.rst
create mode 100644 arch/x86/include/asm/ibs-caps.h
create mode 100644 arch/x86/include/asm/ibs-mprof.h
create mode 100644 arch/x86/mm/ibs-mprof.c
create mode 100644 include/linux/pghot.h
create mode 100644 mm/pghot-default.c
create mode 100644 mm/pghot-precise.c
create mode 100644 mm/pghot-tunables.c
create mode 100644 mm/pghot.c
[PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 month, 1 week ago
Hi,

This is v7 of pghot, a hot-page tracking and promotion subsystem. The
main change in this version is to add support for IBS Memory Profiler
as page hotness source(PGHOT_HWHINTS). IBS Memory Profiler is a
facility that will be present in future AMD processors. It provides memory
access information and is independent of the existing IBS instance that
is primarily used by the perf subsystem.

This patchset introduces a new subsystem for hot page tracking and
promotion (pghot) with the following goals:

- Unify hot page detection from multiple sources like hint faults,
  page table scans, hardware hints (AMD IBS).
- Decouple detection from migration.
- Centralize promotion logic via per-lower-tier-node kmigrated kernel
  thread.
- Move promotion rate‑limiting and related logic used by
  numa_balancing=2 (NUMAB2, the current NUMA balancing–based promotion)
  from the scheduler to pghot for broader reuse.
  
Currently, multiple kernel subsystems detect page accesses independently.
This patchset consolidates accesses from these mechanisms by providing:

- A common API for reporting page accesses.
- Shared infrastructure for tracking hotness at PFN granularity.
- Per-lower-tier-node kernel threads for promoting pages.

Here is a brief summary of how this subsystem works:

- Tracks frequency and last access time.
- Additionally, the accessing NUMA node ID (NID) for each recorded
  access is also tracked in the precision mode.
- These hotness parameters are maintained in a per-PFN hotness record
  within the existing mem_section data structure.
  - In default mode, one byte (u8) is used for hotness record. 5 bits are
    used to store time and bucketing scheme is used to represent a total
    access time up to 4s with HZ=1000. Default toptier NID (0) is used as
    the target for promotion which can be changed via debugfs tunable.
  - In precision mode, 4 bytes (u32) are used for each hotness record.
    14 bits are used to store time which can represent around 16s
    with HZ=1000.
- Classifies pages as hot based on configurable thresholds.
- Pages classified as hot are marked as ready for migration using the
  ready bit. Both modes use MSB of the hotness record as ready bit.
- Per-lower-tier-node kmigrated threads periodically scan the PFNs of
  lower-tier nodes, checking for the migration-ready bit to perform
  batched migrations. Interval between successive scans and batching
  value are configurable via debugfs tunables.

Memory overhead
---------------
Default mode: 1 byte per lower-tier PFN. For a 1TB lower-tier memory
this amounts to 256MB overhead (assuming 4K pages)

Precision mode: 4 bytes per lower-tier PFN. For a 1TB of lower memory
this amounts to 1G overhead.

Bit layout of hotness record
----------------------------
Default mode
- Bits 0-1: Frequency (2bits, 4 access samples)
- Bits 2-6: Bucketed time (5bits, up to 4s with HZ=1000)
- Bit 7: Migration ready bit

Precision mode
- Bits 0-9: Target NID (10bits)
- Bits 10-12: Frequency (3bits, 8 access samples)
- Bits 13-26: Time (14bits, up to 16s with HZ=1000)
- Bits 27-30: Reserved
- Bit 31: Migration ready bit

Potential hotness sources
-------------------------
1. NUMA Balancing (NUMAB2, Tiering mode) - included in this patchset.
2. AMD IBS Memory Profiler: HW based access profiler - included in this
   patchset.
3. klruscand - PTE‑A bit scanning built on MGLRU’s walk helpers - was
   showcased in previous versions but not part of this version.
4. folio_mark_accessed() - Page cache access tracking (unmapped
   page cache pages) - was showcased in previous versions but not part
   of this patchset.

Changes in v7
-------------
- Added AMD IBS Memory Profiler as page hotness source.
- Addressed review comments from v6 (Thanks to Shashiko AI, Gregory and Donet)
  - Early exit from batched migration routine if input
    list is empty
  - Changed the name of batched migration routine to indicate
    that it handles "promotion" of batched "memcg" folios.
  - Debug code in batched migration routine to check if all
    the folios in the input list belong to the same memcg.
  - Kconfig dependency cleanups.
  - Fix one-off-regression in nid check in pghot-precise.
  - More checks to validate nid in pghot-precise.
  - Early check to not call kmigrated_run() for lower tier nodes.
  - Handling PTE writable and ignore_writable conditions correctly
    in hint fault handler.
  - Using unsigned int instead of unsigned long for representing
    time in ms.
  - Misc cleanups.

Results
=======
Posted as replies to this mail thread.

This v7 patchset applies on top of upstream commit c1f49dea2b8f and
can be fetched from:

https://github.com/AMDESE/linux-mm/tree/bharata/pghot-v7

v6: https://lore.kernel.org/linux-mm/20260323095104.238982-1-bharata@amd.com/
v5: https://lore.kernel.org/linux-mm/20260129144043.231636-1-bharata@amd.com/
v4: https://lore.kernel.org/linux-mm/20251206101423.5004-1-bharata@amd.com/
v3: https://lore.kernel.org/linux-mm/20251110052343.208768-1-bharata@amd.com/
v2: https://lore.kernel.org/linux-mm/20250910144653.212066-1-bharata@amd.com/
v1: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/
v0: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/

Bharata B Rao (6):
  mm: migrate: Allow misplaced migration without VMA
  mm: Hot page tracking and promotion - pghot
  mm: pghot: Precision mode for pghot
  mm: sched: move NUMA balancing tiering promotion to pghot
  x86/ibs: Move IBS caps definitions into its own header
  x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler

Gregory Price (1):
  mm: migrate: Add promote_misplaced_memcg_folios()

 Documentation/admin-guide/mm/index.rst |   1 +
 Documentation/admin-guide/mm/pghot.rst |  80 ++++
 arch/x86/Kconfig                       |  16 +
 arch/x86/include/asm/ibs-caps.h        |  93 ++++
 arch/x86/include/asm/ibs-mprof.h       |  46 ++
 arch/x86/include/asm/msr-index.h       |   8 +
 arch/x86/include/asm/perf_event.h      |  81 +---
 arch/x86/mm/Makefile                   |   1 +
 arch/x86/mm/ibs-mprof.c                | 308 ++++++++++++
 include/linux/cpuhotplug.h             |   1 +
 include/linux/migrate.h                |   9 +-
 include/linux/mm.h                     |  35 +-
 include/linux/mmzone.h                 |  24 +-
 include/linux/pghot.h                  | 113 +++++
 include/linux/vm_event_item.h          |  11 +
 init/Kconfig                           |  13 +
 kernel/sched/core.c                    |   7 +
 kernel/sched/debug.c                   |   1 -
 kernel/sched/fair.c                    | 177 +------
 kernel/sched/sched.h                   |   1 -
 mm/Kconfig                             |  34 ++
 mm/Makefile                            |   6 +
 mm/huge_memory.c                       |  24 +-
 mm/memcontrol.c                        |   6 +-
 mm/memory-tiers.c                      |  15 +-
 mm/memory.c                            |  28 +-
 mm/mempolicy.c                         |   3 -
 mm/migrate.c                           |  98 +++-
 mm/mm_init.c                           |  10 +
 mm/pghot-default.c                     |  79 +++
 mm/pghot-precise.c                     |  81 ++++
 mm/pghot-tunables.c                    | 182 +++++++
 mm/pghot.c                             | 633 +++++++++++++++++++++++++
 mm/vmstat.c                            |  13 +-
 34 files changed, 1922 insertions(+), 316 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/pghot.rst
 create mode 100644 arch/x86/include/asm/ibs-caps.h
 create mode 100644 arch/x86/include/asm/ibs-mprof.h
 create mode 100644 arch/x86/mm/ibs-mprof.c
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/pghot-default.c
 create mode 100644 mm/pghot-precise.c
 create mode 100644 mm/pghot-tunables.c
 create mode 100644 mm/pghot.c

base-commit: c1f49dea2b8f335813d3b348fd39117fb8efb428

IBS Memory Profiler driver part of this patchset depends on the
patchset that increases the number of APIC EILVT registers -
https://lore.kernel.org/lkml/cover.1775019269.git.naveen@kernel.org/
-- 
2.34.1

Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 month, 1 week ago
On 04-May-26 11:39 AM, Bharata B Rao wrote:
> Results
> =======
> Posted as replies to this mail thread.

Graph500 benchmark results:

Test system details
-------------------
3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2)

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-95,192-287
node 0 size: 128460 MB
node 1 cpus: 96-191,288-383
node 1 size: 128893 MB
node 2 cpus:
node 2 size: 257993 MB
node distances:
node   0   1   2
  0:  10  32  50
  1:  32  10  60
  2:  255  255  10

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
         in the pghot case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
         use of hint faults as source in the pghot case.
NUMAB3 - Enabled both regular and tiering mode of NUMA Balancing
         (kernel.numa_balancing=3)

Pghot by default promotes after two accesses but for NUMAB2 source,
promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)

Graph500 details
----------------
Command: mpirun -n 128 --bind-to core --map-by core
graph500/src/graph500_reference_bfs 28 16

After the graph creation, the processes are stopped and data is migrated
to CXL node 2 before continuing so that BFS phase starts accessing lower
tier memory.

Total memory usage is slightly over 100GB and will fit within Node 0 and 1.
Hence there is no memory pressure to induce demotions.

harmonic_mean_TEPS - Higher is better
=====================================================================================
                        Base            Base            pghot-default
pghot-precise
                        NUMAB0          NUMAB2          NUMAB2          NUMAB2
=====================================================================================
harmonic_mean_TEPS      5.08026e+08     7.48633e+08     5.46257e+08     7.45101e+08
mean_time               8.45413         5.73702         7.86245         5.76421
median_TEPS             5.09236e+08     7.25058e+08     5.40525e+08     7.63752e+08
max_TEPS                5.15244e+08     1.03391e+09     8.51317e+08     9.7552e+08

pgpromote_success       0               13809474        13763582        13763155
numa_pte_updates        0               26746117        39502157        36368086
numa_hint_faults        0               13811769        24248272        21172314
=====================================================================================
                                                        pghot-default
                                                        NUMAB3
=====================================================================================
harmonic_mean_TEPS                                      7.00515e+08
mean_time                                               6.13109
median_TEPS                                             7.06813e+08
max_TEPS                                                7.63164e+08

pgpromote_success                                       13762087
numa_pte_updates                                        93632490
numa_hint_faults                                        70566306
=====================================================================================
- The base case shows a good improvement with NUMAB2 in harmonic_mean_TEPS.
- The same improvement gets maintained with pghot-precise too.
- pghot-default mode doesn't show benefit even when achieving similar page promotion
  numbers. This mode doesn't track accessing NID and by default promotes to NID=0
  which probably isn't all that beneficial as processes are running on both Node 0
  and Node 1.
- pghot-default recovers the performance when balancing between toptier nodes
  0 and 1 is enabled in addition to hot page promotion.
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Andrew Morton 1 month ago
On Tue, 5 May 2026 16:11:43 +0530 Bharata B Rao <bharata@amd.com> wrote:

> On 04-May-26 11:39 AM, Bharata B Rao wrote:
> > Results
> > =======
> > Posted as replies to this mail thread.
> 
> Graph500 benchmark results:

Please include (and maintain) the testing results in the formal
changelogs (perhaps in the [0/N], in a condensed summary form).

I mean, the entire point of the whole patchset is to improve
performance (yes?), so this contribution lives or dies by its
performance testing results.

The first thing your audience will want to know is "how good is this
for our users".  So tell us!  Up front, within the first paragraphs!

The better the results, the more motivated people will be to help get
your work upstream.
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 month ago
On 09-May-26 6:48 AM, Andrew Morton wrote:
> On Tue, 5 May 2026 16:11:43 +0530 Bharata B Rao <bharata@amd.com> wrote:
> 
>> On 04-May-26 11:39 AM, Bharata B Rao wrote:
>>> Results
>>> =======
>>> Posted as replies to this mail thread.
>>
>> Graph500 benchmark results:
> 
> Please include (and maintain) the testing results in the formal
> changelogs (perhaps in the [0/N], in a condensed summary form).

The results and associated description were getting too long and hence I was
hesitating to make it part of 0/N. But then as you say, I shall include a
condensed summary from next time.

> 
> I mean, the entire point of the whole patchset is to improve
> performance (yes?), so this contribution lives or dies by its
> performance testing results.

The entire point of this patchset is not just about improving the performance.
It is mainly about adding a new dedicated infrastructure for detecting and
promoting hot pages. It is about having a subsystem that can act as a single
source of truth page hotness in the kernel. Though we aren't there yet, we have
started by having a minimal infrastructure that centralizes the hot page
promotion and associated heuristics that currently sits in scheduler so that the
same can be used with other page hotness sources as well.

The first source is the hintfaults based hot page promotion. Here the address
space scanning and introduction of hint faults still remains like earlier. But
the promotion engine is part of pghot. Hence the comparison numbers with base
this source is about meeting the current level of performance and ensuring that
the workloads don't suffer due to batched migration.

There are other sources as well with primary one being the IBS Memory Profiler
which provides memory access information directly from the hardware. I have some
numbers for this source as well. Initial results look encouraging and more tests
can tell us if this source can be an independent one or complements the existing
one.

Then the earlier versions of this patchset had another source - PTE A bit based
scanning where the idea was to completely replace the hint fault based mechanism
by PTE A bit based accesses thereby taking out both the detection and promotion
parts out of the process context. I have temporarily removed this from this
patchset for two reasons: a) to simplify the patchset so that we can get some
consensus on the infrastructure part first. b) to explore the commonality with
another PTE A bit scanning approach (called klruscand) that used MGLRU's
scanning mechanism.

Also on the horizon is to use hot page info that CXL Hotness Monitoring Unit
(CHMU) can provide.

> 
> The first thing your audience will want to know is "how good is this
> for our users".  So tell us!  Up front, within the first paragraphs!
> 
> The better the results, the more motivated people will be to help get
> your work upstream.

So currently it is a multi-step approach with first step of building a common
hotness infrastructure and moving existing mechanism to make use of it w/o any
regression. Then follow up with more sources.

Regards,
Bharata.
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Gregory Price 1 month ago
On Mon, May 11, 2026 at 04:07:16PM +0530, Bharata B Rao wrote:
> 
> The entire point of this patchset is not just about improving the performance.
> It is mainly about adding a new dedicated infrastructure for detecting and
> promoting hot pages. It is about having a subsystem that can act as a single
> source of truth page hotness in the kernel. Though we aren't there yet, we have
> started by having a minimal infrastructure that centralizes the hot page
> promotion and associated heuristics that currently sits in scheduler so that the
> same can be used with other page hotness sources as well.
> 

The goal of hotness tracking in general is to improve performance.

The goal of PGHot should be a reasonable baseline for the kernel to
course-correct LRU inversions across tiers over time, because LRU
threads only scan invidiual nodes and don't compare across nodes.

I would hazard against trying to wholesale state it "Shall be the single
source of truth", as we will inevitably discover some condition which is
not covered / cannot be captured / we will simply get it wrong.

Plus, intuitively, counter-balancing LRU/MGLRU aging is probably as good
good as we can get without having to inject per-workload information
into the system - at which point the users should use DAMON.

~Gregory
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 month ago
On 11-May-26 8:08 PM, Gregory Price wrote:
> On Mon, May 11, 2026 at 04:07:16PM +0530, Bharata B Rao wrote:
>>
>> The entire point of this patchset is not just about improving the performance.
>> It is mainly about adding a new dedicated infrastructure for detecting and
>> promoting hot pages. It is about having a subsystem that can act as a single
>> source of truth page hotness in the kernel. Though we aren't there yet, we have
>> started by having a minimal infrastructure that centralizes the hot page
>> promotion and associated heuristics that currently sits in scheduler so that the
>> same can be used with other page hotness sources as well.
>>
> 
> The goal of hotness tracking in general is to improve performance.

Agreed. As I have mentioned elsewhere in the thread, right now we have just
moved the existing promotion mechanism to pghot, hence the initial concern has
been to ensure the earlier performance levels are still met with centralized
promotion engine that does batched promotions from non-process context.

> 
> The goal of PGHot should be a reasonable baseline for the kernel to
> course-correct LRU inversions across tiers over time, because LRU
> threads only scan invidiual nodes and don't compare across nodes.

Right.

> 
> I would hazard against trying to wholesale state it "Shall be the single
> source of truth", as we will inevitably discover some condition which is
> not covered / cannot be captured / we will simply get it wrong.

Yeah. The ideal goal of single source of truth may be a bit far fetched but
pghot is definitely a subsystem that can work with multiple page hotness
sources, aggregate hot signals from them and provide a single unified promotion
mechanism.

Regards,
Bharata.
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Matthew Wilcox 1 month, 1 week ago
On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
> This is v7 of pghot, a hot-page tracking and promotion subsystem. The

I continue to think we should not do this.
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Gregory Price 1 month ago
On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote:
> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
> > This is v7 of pghot, a hot-page tracking and promotion subsystem. The
> 
> I continue to think we should not do this.

My only pushback on the general "we should not do this" is that we need
something to counter-balance the demotion bit in vmscan.c, and the
current implementation (prot_none faults) is rather :[

I think this series needs to greatly limit its complexity and provide
some gentle correction for LRU inversions, and I think they're making a
decent attempt at that.

But then I think local memory expansion on CXL is going pretty
swimmingly in our datacenters :], others may not feel the same.

~Gregory
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 month ago

On 06-May-26 8:52 PM, Gregory Price wrote:
> On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote:
>> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
>>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
>>
>> I continue to think we should not do this.
> 
> My only pushback on the general "we should not do this" is that we need
> something to counter-balance the demotion bit in vmscan.c, and the
> current implementation (prot_none faults) is rather :[

So you are saying pghot subsystem currently does hot page detection and
promotion only, which is fine. But the current implementation of demotion is not
very optimal and hence we should spend effort in fine-tuning demotion first?

In this series itself I have shown via benchmark numbers that for over-committed
cases (involving both demotion and promotion), the workload isn't really showing
real benefit due to demotion and promotion. Are you specifically referring to
this problem?

> 
> I think this series needs to greatly limit its complexity and provide
> some gentle correction for LRU inversions, and I think they're making a
> decent attempt at that.

Regarding complexity, I agree that the initial version of this patchset was
quite complicated in the way it maintained hot page information. But the later
versions including this one have greatly reduced the complexity with one byte of
hot page information per PFN, atomic updates to hotness data without any locks,
per-lowertier kmigrated threads for promotion and reuse of existing hot page
promotion engine.

Did you have anything else in mind wrt complexity?

Can you provide more context about the LRU inversion problem?

Regards,
Bharata.
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Gregory Price 1 month ago
On Mon, May 11, 2026 at 03:32:20PM +0530, Bharata B Rao wrote:
> 
> 
> On 06-May-26 8:52 PM, Gregory Price wrote:
> > On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote:
> >> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
> >>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
> >>
> >> I continue to think we should not do this.
> > 
> > My only pushback on the general "we should not do this" is that we need
> > something to counter-balance the demotion bit in vmscan.c, and the
> > current implementation (prot_none faults) is rather :[
> 
> So you are saying pghot subsystem currently does hot page detection and
> promotion only, which is fine. But the current implementation of demotion is not
> very optimal and hence we should spend effort in fine-tuning demotion first?
>

I'm saying because of demotion and fallbacks, we need a mechanism to
handle promotions.  I'm not convinced a hotness will extend to coldness
- at least any better than LRU/MGLRU.

> In this series itself I have shown via benchmark numbers that for over-committed
> cases (involving both demotion and promotion), the workload isn't really showing
> real benefit due to demotion and promotion. Are you specifically referring to
> this problem?
> 

If over-committed means over-subscribed hot-tier (more hot memory than
available top tier memory), then yeah that result is intuitive.  I
haven't pointed to any specific issue, as of yet, still taking time to
consider some of the results.

> 
> Can you provide more context about the LRU inversion problem?
> 

I've been tracking some data around shrink_folio_list and
alloc_migrate_folio behavior when a low tier node is full.

The result is we end up just swapping memory from high tier straight to
swap and skip demotion, resulting in a bunch of file and anon refaults.

Hardware: Single Socket, 768GB DRAM, 256GB CXL Expander

In this workload, we see swap usage after the full 1TB of memory is
utilized, and as a result we see swap spillage.

second_chance = second alloc attempt in alloc_migrate_folio succeeds
swap_fallback = second chance fails, we swap directly from top tier

Sample data:

pgdemote_kswapd           333052779
pgdemote_direct          3181480482
pgdemote_second_chance     31017629
pgdemote_swap_fallback    335759535
workingset_refault_anon    30106868
workingset_refault_file  2343035341

(note here: swap fallback is number of occurances, while the others are
 number of pages.  As a result, the actual number of swapped pages is
 likely much closer to the pgdemote_direct number)

As a result:  LRU is just broken on CXL systems, LRU inverts by design.

In a sane world we would just see the second tier as an extention of the
LRU, but that doesn't necessarily mean we can gleen hotness data from it
(it's still largely a coldness tracking mechanism).

I have patches I haven't RFC'd yet that try to address this, but I need
more time to test it.

I don't think this is something to address with PGHot.

---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 112983b42559..ccdd698c5937 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1043,7 +1043,10 @@ struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
        mtc->gfp_mask &= ~__GFP_THISNODE;
        mtc->nmask = allowed_mask;

-       return alloc_migration_target(src, (unsigned long)mtc);
+       dst = alloc_migration_target(src, (unsigned long)mtc);
+       if (dst)
+               count_vm_events(PGDEMOTE_SECOND_CHANCE, folio_nr_pages(src));
+       return dst;
 }

 /*
@@ -1616,6 +1619,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
        /* Folios that could not be demoted are still in @demote_folios */
        if (!list_empty(&demote_folios)) {
                /* Folios which weren't demoted go back on @folio_list */
+               if (!sc->proactive)
+                       count_vm_event(PGDEMOTE_SWAP_FALLBACK);
                list_splice_init(&demote_folios, folio_list);

                /*
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Balbir Singh 1 month, 1 week ago
On 5/5/26 06:36, Matthew Wilcox wrote:
> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
> 
> I continue to think we should not do this.
> 

I am unclear about the benefits of the patchset, I have not tested
it or reviewed the latest revision. My big concern was that top-tier
might not always be suitable.

I see that there are some numbers posted, but I find this weird
"After the graph creation, the processes are stopped and data is migrated
to CXL node 2 before continuing so that BFS phase starts accessing lower
tier memory." Why not allocate everything on CXL node 2?

Balbir
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 month, 1 week ago
On 06-May-26 3:47 AM, Balbir Singh wrote:
> On 5/5/26 06:36, Matthew Wilcox wrote:
>> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
>>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
>>
>> I continue to think we should not do this.
>>
> 
> I am unclear about the benefits of the patchset, I have not tested
> it or reviewed the latest revision. My big concern was that top-tier
> might not always be suitable.

So you are saying that we should have a capability to promote accessed pages
from lower tier to an other tier that is not classified as top tier? Is that
non-top tier node the one which generates accesses?

> 
> I see that there are some numbers posted, but I find this weird
> "After the graph creation, the processes are stopped and data is migrated
> to CXL node 2 before continuing so that BFS phase starts accessing lower
> tier memory." Why not allocate everything on CXL node 2?

In the ideal scenario, the benefit is to see if any pages that land up on lower
tier get identified as hot and get promoted. That means we need to create an
over-committed scenario where the pages get demoted first. I have provided
numbers from such cases in my previous versions. The problem with this case is
that the base hot page promotion (NUMAB2) hasn't shown any benefit at all with
my micro-benchmark - Ref:
https://lore.kernel.org/linux-mm/868004d8-bb8e-4800-9fdd-ade48e95fe3b@amd.com/

Same has been observed with redis-memtier benchmark -
https://lore.kernel.org/linux-mm/957f2242-56d4-4bf0-8aeb-9d60fbea8c8c@amd.com/

Instead what I am doing here is to take out demotion from the scenario but still
retain the access pattern of the benchmark by pushing out the data to lower tier
when the benchmark reaches steady allocation state.

Regards,
Bharata.
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Balbir Singh 1 month, 1 week ago
On 5/6/26 13:43, Bharata B Rao wrote:
> On 06-May-26 3:47 AM, Balbir Singh wrote:
>> On 5/5/26 06:36, Matthew Wilcox wrote:
>>> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
>>>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
>>>
>>> I continue to think we should not do this.
>>>
>>
>> I am unclear about the benefits of the patchset, I have not tested
>> it or reviewed the latest revision. My big concern was that top-tier
>> might not always be suitable.
> 
> So you are saying that we should have a capability to promote accessed pages
> from lower tier to an other tier that is not classified as top tier? Is that
> non-top tier node the one which generates accesses?
> 

Yes, a top tier node could be CPU less for example.

>>
>> I see that there are some numbers posted, but I find this weird
>> "After the graph creation, the processes are stopped and data is migrated
>> to CXL node 2 before continuing so that BFS phase starts accessing lower
>> tier memory." Why not allocate everything on CXL node 2?
> 
> In the ideal scenario, the benefit is to see if any pages that land up on lower
> tier get identified as hot and get promoted. That means we need to create an
> over-committed scenario where the pages get demoted first. I have provided

Why do the pages need to get demoted? Why not allocate them from the lower tier
to show that promotion upwards is helpful

> numbers from such cases in my previous versions. The problem with this case is
> that the base hot page promotion (NUMAB2) hasn't shown any benefit at all with
> my micro-benchmark - Ref:
> https://lore.kernel.org/linux-mm/868004d8-bb8e-4800-9fdd-ade48e95fe3b@amd.com/
> 
> Same has been observed with redis-memtier benchmark -
> https://lore.kernel.org/linux-mm/957f2242-56d4-4bf0-8aeb-9d60fbea8c8c@amd.com/
> 
> Instead what I am doing here is to take out demotion from the scenario but still
> retain the access pattern of the benchmark by pushing out the data to lower tier
> when the benchmark reaches steady allocation state.
> 

Balbir
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 month, 1 week ago
On 06-May-26 9:32 AM, Balbir Singh wrote:
>>> I am unclear about the benefits of the patchset, I have not tested
>>> it or reviewed the latest revision. My big concern was that top-tier
>>> might not always be suitable.
>>
>> So you are saying that we should have a capability to promote accessed pages
>> from lower tier to an other tier that is not classified as top tier? Is that
>> non-top tier node the one which generates accesses?
>>
> 
> Yes, a top tier node could be CPU less for example.

Currently kmigrated thread in pghot doesn't explicitly prevent promotion to
non-toptier nodes. Here is how this works for the two modes of operation in pghot:

pghot-default: In this mode, the target NID isn't explicitly tracked and hence
kmigrated relies on the user-configurable pghot_target_nid. Though there is a
!node_is_toptier(nid) check in the helper routine that populates
pghot_target_nid, that can be relaxed if required.

pghot-precise: In this mode, the accessing CPU's node is tracked as the target
nid and promotion is done to that node. Note that pghot_target_nid isn't used here.

Hence I don't see any major issues in this patchset to cover your use case. Let
me know if I miss anything here.

BTW, does the existing hot page promotion cover the use case you are targeting?

> 
>>>
>>> I see that there are some numbers posted, but I find this weird
>>> "After the graph creation, the processes are stopped and data is migrated
>>> to CXL node 2 before continuing so that BFS phase starts accessing lower
>>> tier memory." Why not allocate everything on CXL node 2?
>>
>> In the ideal scenario, the benefit is to see if any pages that land up on lower
>> tier get identified as hot and get promoted. That means we need to create an
>> over-committed scenario where the pages get demoted first. I have provided
> 
> Why do the pages need to get demoted? Why not allocate them from the lower tier
> to show that promotion upwards is helpful

As you can see, these are controlled experiments to measure the effectiveness of
hot page detection and promotion and the benefits from promotion. It can be done
in the way you are suggesting; just that I found it a bit simpler to pause the
benchmark, migrate all pages to lower tier memory before the benchmark starts
accessing them rather than relying on setting memory policies to achieve the
same effect.

Regards,
Bharata.
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 month, 1 week ago
On 04-May-26 11:39 AM, Bharata B Rao wrote:
> 
> Results
> =======
> Posted as replies to this mail thread.

Micro-benchmark numbers for IBS Memory Profiler pghot source

Test system details
-------------------
2 node AMD system with 1 regular NUMA node (0) and a CXL node (1)

$ numactl -H
available: 3 nodes (0-1)
node 0 cpus: 0-255
node 0 size: 515563 MB
node 1 cpus:
node 1 size: 258034 MB
node distances:
node distances:
node   0   1
  0:  10  50
  1:  255  10

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
         in the patched case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case
HWHINTS - IBS Memory Profiler as source for pghot

Pghot by default promotes after two accesses but for NUMAB2 source,
promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)

==============================================================
Scenario 1 - Enough memory in toptier and hence only promotion
==============================================================
Multi-threaded application with 64 threads that access memory(8G) at
4K granularity repetitively and randomly. The number of accesses per
thread and the randomness pattern for each thread are fixed beforehand.
The accesses are divided into stores and loads in the ratio of 50:50.

Benchmark threads run on Node 0, while memory is initially provisioned on
CXL node 1 before the accesses start.

Repetitive accesses results in lowertier pages becoming hot and kmigrated
detecting and migrating them. The benchmark score is the time taken to
finish the accesses in microseconds. The sooner it finishes the better it is.
All the numbers shown below are average of 3 runs.

Time taken (microseconds, lower is better)
---------------------------------------------------------
Source          Base            Pghot-default
---------------------------------------------------------
NUMAB0          181,393,365     184,331,381
NUMAB2          42,287,528
HWHINTS         NA              50,422,862
---------------------------------------------------------

Stats comparision b/n base-NUMAB2 and pghot-default-hwhints
---------------------------------------------------------------------
                                Base-NUMAB2     Pghot-default-hwhints
---------------------------------------------------------------------
pgpromote_success               2097152         1961087
numa_hint_faults                2358069         0
pghot_recorded_accesses         NA              1962696
pghot_recorded_hintfaults       NA              0
pghot_recorded_hwhints          NA              5532979
hwhint_total_events             NA              5532979
---------------------------------------------------------------------
Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 month, 1 week ago
On 04-May-26 11:39 AM, Bharata B Rao wrote:
> Results
> =======
> Posted as replies to this mail thread.

Initial Graph500 benchmark numbers for IBS Memory Profiler source:

Test system details
-------------------
3 node AMD system with 2 regular NUMA nodes (0, 1) in NPS2 mode and a CXL node (2)

$ numactl -H
available: 3 nodes (0-2)
node distances:
node 0 cpus: 0-63,128-191
node 0 size: 257715 MB
node 1 cpus: 64-127,192-255
node 1 size: 257845 MB
node 2 cpus:
node 2 size: 258032 MB
node distances:
node   0   1   2
  0:  10  12  50
  1:  12  10  50
  2:  255  255  10

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
         in the pghot case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
         use of hint faults as source in the pghot case.
HWHINTS - IBS Memory Profiler as source for pghot

Pghot by default promotes after two accesses but for NUMAB2 and HWHINTS
sources, promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)

Graph500 details
----------------
Command: mpirun -n 128 --bind-to core --map-by core
graph500/src/graph500_reference_bfs 28 16

After the graph creation, the processes are stopped and data is migrated
to CXL node 2 before continuing so that BFS phase starts accessing lower
tier memory.

Total memory usage is slightly over 100GB and will fit within Node 0 and 1.
Hence there is no memory pressure to induce demotions.

harmonic_mean_TEPS - Higher is better
=============================================================================
                                Base            Base            pghot-default
                                NUMAB0          NUMAB2          NUMAB2
=============================================================================
harmonic_mean_TEPS              4.09614e+08     1.28401e+09     1.47926e+09
mean_time                       10.4853         3.34492         2.90342
median_TEPS                     4.10086e+08     1.44584e+09     1.85957e+09
max_TEPS                        4.1661e+08      1.79773e+09     1.99242e+09

pgpromote_success               0               13746029        13412213
numa_hint_faults                0               13753808        26669823

pghot_recorded_accesses         NA              NA              26669551
pghot_recorded_hintfaults       NA              NA              26669823
pghot_recorded_hwhints          NA              NA              0
hwhint_total_events             NA              NA              0
=============================================================================
                                                                pghot-default
                                                                HWHINTS
=============================================================================
harmonic_mean_TEPS                                              1.52334e+09
mean_time                                                       2.81941
median_TEPS                                                     1.57446e+09
max_TEPS                                                        1.72014e+09

pgpromote_success                                               3415599
numa_hint_faults                                                0

pghot_recorded_accesses                                         3440912
pghot_recorded_hintfaults                                       0
pghot_recorded_hwhints                                          24475210
hwhint_total_events                                             24475244
=============================================================================
While no migration (NUMAB0) at all hurts Graph500, HWHINTS with pghot is able
to provide similar benchmark numbers even when not migrating as aggressively
as base NUMAB2.