Documentation/admin-guide/mm/index.rst | 1 + Documentation/admin-guide/mm/pghot.rst | 80 ++++ arch/x86/Kconfig | 16 + arch/x86/include/asm/ibs-caps.h | 93 ++++ arch/x86/include/asm/ibs-mprof.h | 46 ++ arch/x86/include/asm/msr-index.h | 8 + arch/x86/include/asm/perf_event.h | 81 +--- arch/x86/mm/Makefile | 1 + arch/x86/mm/ibs-mprof.c | 308 ++++++++++++ include/linux/cpuhotplug.h | 1 + include/linux/migrate.h | 9 +- include/linux/mm.h | 35 +- include/linux/mmzone.h | 24 +- include/linux/pghot.h | 113 +++++ include/linux/vm_event_item.h | 11 + init/Kconfig | 13 + kernel/sched/core.c | 7 + kernel/sched/debug.c | 1 - kernel/sched/fair.c | 177 +------ kernel/sched/sched.h | 1 - mm/Kconfig | 34 ++ mm/Makefile | 6 + mm/huge_memory.c | 24 +- mm/memcontrol.c | 6 +- mm/memory-tiers.c | 15 +- mm/memory.c | 28 +- mm/mempolicy.c | 3 - mm/migrate.c | 98 +++- mm/mm_init.c | 10 + mm/pghot-default.c | 79 +++ mm/pghot-precise.c | 81 ++++ mm/pghot-tunables.c | 182 +++++++ mm/pghot.c | 633 +++++++++++++++++++++++++ mm/vmstat.c | 13 +- 34 files changed, 1922 insertions(+), 316 deletions(-) create mode 100644 Documentation/admin-guide/mm/pghot.rst create mode 100644 arch/x86/include/asm/ibs-caps.h create mode 100644 arch/x86/include/asm/ibs-mprof.h create mode 100644 arch/x86/mm/ibs-mprof.c create mode 100644 include/linux/pghot.h create mode 100644 mm/pghot-default.c create mode 100644 mm/pghot-precise.c create mode 100644 mm/pghot-tunables.c create mode 100644 mm/pghot.c
Hi,
This is v7 of pghot, a hot-page tracking and promotion subsystem. The
main change in this version is to add support for IBS Memory Profiler
as page hotness source(PGHOT_HWHINTS). IBS Memory Profiler is a
facility that will be present in future AMD processors. It provides memory
access information and is independent of the existing IBS instance that
is primarily used by the perf subsystem.
This patchset introduces a new subsystem for hot page tracking and
promotion (pghot) with the following goals:
- Unify hot page detection from multiple sources like hint faults,
page table scans, hardware hints (AMD IBS).
- Decouple detection from migration.
- Centralize promotion logic via per-lower-tier-node kmigrated kernel
thread.
- Move promotion rate‑limiting and related logic used by
numa_balancing=2 (NUMAB2, the current NUMA balancing–based promotion)
from the scheduler to pghot for broader reuse.
Currently, multiple kernel subsystems detect page accesses independently.
This patchset consolidates accesses from these mechanisms by providing:
- A common API for reporting page accesses.
- Shared infrastructure for tracking hotness at PFN granularity.
- Per-lower-tier-node kernel threads for promoting pages.
Here is a brief summary of how this subsystem works:
- Tracks frequency and last access time.
- Additionally, the accessing NUMA node ID (NID) for each recorded
access is also tracked in the precision mode.
- These hotness parameters are maintained in a per-PFN hotness record
within the existing mem_section data structure.
- In default mode, one byte (u8) is used for hotness record. 5 bits are
used to store time and bucketing scheme is used to represent a total
access time up to 4s with HZ=1000. Default toptier NID (0) is used as
the target for promotion which can be changed via debugfs tunable.
- In precision mode, 4 bytes (u32) are used for each hotness record.
14 bits are used to store time which can represent around 16s
with HZ=1000.
- Classifies pages as hot based on configurable thresholds.
- Pages classified as hot are marked as ready for migration using the
ready bit. Both modes use MSB of the hotness record as ready bit.
- Per-lower-tier-node kmigrated threads periodically scan the PFNs of
lower-tier nodes, checking for the migration-ready bit to perform
batched migrations. Interval between successive scans and batching
value are configurable via debugfs tunables.
Memory overhead
---------------
Default mode: 1 byte per lower-tier PFN. For a 1TB lower-tier memory
this amounts to 256MB overhead (assuming 4K pages)
Precision mode: 4 bytes per lower-tier PFN. For a 1TB of lower memory
this amounts to 1G overhead.
Bit layout of hotness record
----------------------------
Default mode
- Bits 0-1: Frequency (2bits, 4 access samples)
- Bits 2-6: Bucketed time (5bits, up to 4s with HZ=1000)
- Bit 7: Migration ready bit
Precision mode
- Bits 0-9: Target NID (10bits)
- Bits 10-12: Frequency (3bits, 8 access samples)
- Bits 13-26: Time (14bits, up to 16s with HZ=1000)
- Bits 27-30: Reserved
- Bit 31: Migration ready bit
Potential hotness sources
-------------------------
1. NUMA Balancing (NUMAB2, Tiering mode) - included in this patchset.
2. AMD IBS Memory Profiler: HW based access profiler - included in this
patchset.
3. klruscand - PTE‑A bit scanning built on MGLRU’s walk helpers - was
showcased in previous versions but not part of this version.
4. folio_mark_accessed() - Page cache access tracking (unmapped
page cache pages) - was showcased in previous versions but not part
of this patchset.
Changes in v7
-------------
- Added AMD IBS Memory Profiler as page hotness source.
- Addressed review comments from v6 (Thanks to Shashiko AI, Gregory and Donet)
- Early exit from batched migration routine if input
list is empty
- Changed the name of batched migration routine to indicate
that it handles "promotion" of batched "memcg" folios.
- Debug code in batched migration routine to check if all
the folios in the input list belong to the same memcg.
- Kconfig dependency cleanups.
- Fix one-off-regression in nid check in pghot-precise.
- More checks to validate nid in pghot-precise.
- Early check to not call kmigrated_run() for lower tier nodes.
- Handling PTE writable and ignore_writable conditions correctly
in hint fault handler.
- Using unsigned int instead of unsigned long for representing
time in ms.
- Misc cleanups.
Results
=======
Posted as replies to this mail thread.
This v7 patchset applies on top of upstream commit c1f49dea2b8f and
can be fetched from:
https://github.com/AMDESE/linux-mm/tree/bharata/pghot-v7
v6: https://lore.kernel.org/linux-mm/20260323095104.238982-1-bharata@amd.com/
v5: https://lore.kernel.org/linux-mm/20260129144043.231636-1-bharata@amd.com/
v4: https://lore.kernel.org/linux-mm/20251206101423.5004-1-bharata@amd.com/
v3: https://lore.kernel.org/linux-mm/20251110052343.208768-1-bharata@amd.com/
v2: https://lore.kernel.org/linux-mm/20250910144653.212066-1-bharata@amd.com/
v1: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/
v0: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
Bharata B Rao (6):
mm: migrate: Allow misplaced migration without VMA
mm: Hot page tracking and promotion - pghot
mm: pghot: Precision mode for pghot
mm: sched: move NUMA balancing tiering promotion to pghot
x86/ibs: Move IBS caps definitions into its own header
x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler
Gregory Price (1):
mm: migrate: Add promote_misplaced_memcg_folios()
Documentation/admin-guide/mm/index.rst | 1 +
Documentation/admin-guide/mm/pghot.rst | 80 ++++
arch/x86/Kconfig | 16 +
arch/x86/include/asm/ibs-caps.h | 93 ++++
arch/x86/include/asm/ibs-mprof.h | 46 ++
arch/x86/include/asm/msr-index.h | 8 +
arch/x86/include/asm/perf_event.h | 81 +---
arch/x86/mm/Makefile | 1 +
arch/x86/mm/ibs-mprof.c | 308 ++++++++++++
include/linux/cpuhotplug.h | 1 +
include/linux/migrate.h | 9 +-
include/linux/mm.h | 35 +-
include/linux/mmzone.h | 24 +-
include/linux/pghot.h | 113 +++++
include/linux/vm_event_item.h | 11 +
init/Kconfig | 13 +
kernel/sched/core.c | 7 +
kernel/sched/debug.c | 1 -
kernel/sched/fair.c | 177 +------
kernel/sched/sched.h | 1 -
mm/Kconfig | 34 ++
mm/Makefile | 6 +
mm/huge_memory.c | 24 +-
mm/memcontrol.c | 6 +-
mm/memory-tiers.c | 15 +-
mm/memory.c | 28 +-
mm/mempolicy.c | 3 -
mm/migrate.c | 98 +++-
mm/mm_init.c | 10 +
mm/pghot-default.c | 79 +++
mm/pghot-precise.c | 81 ++++
mm/pghot-tunables.c | 182 +++++++
mm/pghot.c | 633 +++++++++++++++++++++++++
mm/vmstat.c | 13 +-
34 files changed, 1922 insertions(+), 316 deletions(-)
create mode 100644 Documentation/admin-guide/mm/pghot.rst
create mode 100644 arch/x86/include/asm/ibs-caps.h
create mode 100644 arch/x86/include/asm/ibs-mprof.h
create mode 100644 arch/x86/mm/ibs-mprof.c
create mode 100644 include/linux/pghot.h
create mode 100644 mm/pghot-default.c
create mode 100644 mm/pghot-precise.c
create mode 100644 mm/pghot-tunables.c
create mode 100644 mm/pghot.c
base-commit: c1f49dea2b8f335813d3b348fd39117fb8efb428
IBS Memory Profiler driver part of this patchset depends on the
patchset that increases the number of APIC EILVT registers -
https://lore.kernel.org/lkml/cover.1775019269.git.naveen@kernel.org/
--
2.34.1
On 04-May-26 11:39 AM, Bharata B Rao wrote:
> Results
> =======
> Posted as replies to this mail thread.
Graph500 benchmark results:
Test system details
-------------------
3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2)
$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-95,192-287
node 0 size: 128460 MB
node 1 cpus: 96-191,288-383
node 1 size: 128893 MB
node 2 cpus:
node 2 size: 257993 MB
node distances:
node 0 1 2
0: 10 32 50
1: 32 10 60
2: 255 255 10
Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
in the pghot case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
use of hint faults as source in the pghot case.
NUMAB3 - Enabled both regular and tiering mode of NUMA Balancing
(kernel.numa_balancing=3)
Pghot by default promotes after two accesses but for NUMAB2 source,
promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)
Graph500 details
----------------
Command: mpirun -n 128 --bind-to core --map-by core
graph500/src/graph500_reference_bfs 28 16
After the graph creation, the processes are stopped and data is migrated
to CXL node 2 before continuing so that BFS phase starts accessing lower
tier memory.
Total memory usage is slightly over 100GB and will fit within Node 0 and 1.
Hence there is no memory pressure to induce demotions.
harmonic_mean_TEPS - Higher is better
=====================================================================================
Base Base pghot-default
pghot-precise
NUMAB0 NUMAB2 NUMAB2 NUMAB2
=====================================================================================
harmonic_mean_TEPS 5.08026e+08 7.48633e+08 5.46257e+08 7.45101e+08
mean_time 8.45413 5.73702 7.86245 5.76421
median_TEPS 5.09236e+08 7.25058e+08 5.40525e+08 7.63752e+08
max_TEPS 5.15244e+08 1.03391e+09 8.51317e+08 9.7552e+08
pgpromote_success 0 13809474 13763582 13763155
numa_pte_updates 0 26746117 39502157 36368086
numa_hint_faults 0 13811769 24248272 21172314
=====================================================================================
pghot-default
NUMAB3
=====================================================================================
harmonic_mean_TEPS 7.00515e+08
mean_time 6.13109
median_TEPS 7.06813e+08
max_TEPS 7.63164e+08
pgpromote_success 13762087
numa_pte_updates 93632490
numa_hint_faults 70566306
=====================================================================================
- The base case shows a good improvement with NUMAB2 in harmonic_mean_TEPS.
- The same improvement gets maintained with pghot-precise too.
- pghot-default mode doesn't show benefit even when achieving similar page promotion
numbers. This mode doesn't track accessing NID and by default promotes to NID=0
which probably isn't all that beneficial as processes are running on both Node 0
and Node 1.
- pghot-default recovers the performance when balancing between toptier nodes
0 and 1 is enabled in addition to hot page promotion.
On Tue, 5 May 2026 16:11:43 +0530 Bharata B Rao <bharata@amd.com> wrote: > On 04-May-26 11:39 AM, Bharata B Rao wrote: > > Results > > ======= > > Posted as replies to this mail thread. > > Graph500 benchmark results: Please include (and maintain) the testing results in the formal changelogs (perhaps in the [0/N], in a condensed summary form). I mean, the entire point of the whole patchset is to improve performance (yes?), so this contribution lives or dies by its performance testing results. The first thing your audience will want to know is "how good is this for our users". So tell us! Up front, within the first paragraphs! The better the results, the more motivated people will be to help get your work upstream.
On 09-May-26 6:48 AM, Andrew Morton wrote: > On Tue, 5 May 2026 16:11:43 +0530 Bharata B Rao <bharata@amd.com> wrote: > >> On 04-May-26 11:39 AM, Bharata B Rao wrote: >>> Results >>> ======= >>> Posted as replies to this mail thread. >> >> Graph500 benchmark results: > > Please include (and maintain) the testing results in the formal > changelogs (perhaps in the [0/N], in a condensed summary form). The results and associated description were getting too long and hence I was hesitating to make it part of 0/N. But then as you say, I shall include a condensed summary from next time. > > I mean, the entire point of the whole patchset is to improve > performance (yes?), so this contribution lives or dies by its > performance testing results. The entire point of this patchset is not just about improving the performance. It is mainly about adding a new dedicated infrastructure for detecting and promoting hot pages. It is about having a subsystem that can act as a single source of truth page hotness in the kernel. Though we aren't there yet, we have started by having a minimal infrastructure that centralizes the hot page promotion and associated heuristics that currently sits in scheduler so that the same can be used with other page hotness sources as well. The first source is the hintfaults based hot page promotion. Here the address space scanning and introduction of hint faults still remains like earlier. But the promotion engine is part of pghot. Hence the comparison numbers with base this source is about meeting the current level of performance and ensuring that the workloads don't suffer due to batched migration. There are other sources as well with primary one being the IBS Memory Profiler which provides memory access information directly from the hardware. I have some numbers for this source as well. Initial results look encouraging and more tests can tell us if this source can be an independent one or complements the existing one. Then the earlier versions of this patchset had another source - PTE A bit based scanning where the idea was to completely replace the hint fault based mechanism by PTE A bit based accesses thereby taking out both the detection and promotion parts out of the process context. I have temporarily removed this from this patchset for two reasons: a) to simplify the patchset so that we can get some consensus on the infrastructure part first. b) to explore the commonality with another PTE A bit scanning approach (called klruscand) that used MGLRU's scanning mechanism. Also on the horizon is to use hot page info that CXL Hotness Monitoring Unit (CHMU) can provide. > > The first thing your audience will want to know is "how good is this > for our users". So tell us! Up front, within the first paragraphs! > > The better the results, the more motivated people will be to help get > your work upstream. So currently it is a multi-step approach with first step of building a common hotness infrastructure and moving existing mechanism to make use of it w/o any regression. Then follow up with more sources. Regards, Bharata.
On Mon, May 11, 2026 at 04:07:16PM +0530, Bharata B Rao wrote: > > The entire point of this patchset is not just about improving the performance. > It is mainly about adding a new dedicated infrastructure for detecting and > promoting hot pages. It is about having a subsystem that can act as a single > source of truth page hotness in the kernel. Though we aren't there yet, we have > started by having a minimal infrastructure that centralizes the hot page > promotion and associated heuristics that currently sits in scheduler so that the > same can be used with other page hotness sources as well. > The goal of hotness tracking in general is to improve performance. The goal of PGHot should be a reasonable baseline for the kernel to course-correct LRU inversions across tiers over time, because LRU threads only scan invidiual nodes and don't compare across nodes. I would hazard against trying to wholesale state it "Shall be the single source of truth", as we will inevitably discover some condition which is not covered / cannot be captured / we will simply get it wrong. Plus, intuitively, counter-balancing LRU/MGLRU aging is probably as good good as we can get without having to inject per-workload information into the system - at which point the users should use DAMON. ~Gregory
On 11-May-26 8:08 PM, Gregory Price wrote: > On Mon, May 11, 2026 at 04:07:16PM +0530, Bharata B Rao wrote: >> >> The entire point of this patchset is not just about improving the performance. >> It is mainly about adding a new dedicated infrastructure for detecting and >> promoting hot pages. It is about having a subsystem that can act as a single >> source of truth page hotness in the kernel. Though we aren't there yet, we have >> started by having a minimal infrastructure that centralizes the hot page >> promotion and associated heuristics that currently sits in scheduler so that the >> same can be used with other page hotness sources as well. >> > > The goal of hotness tracking in general is to improve performance. Agreed. As I have mentioned elsewhere in the thread, right now we have just moved the existing promotion mechanism to pghot, hence the initial concern has been to ensure the earlier performance levels are still met with centralized promotion engine that does batched promotions from non-process context. > > The goal of PGHot should be a reasonable baseline for the kernel to > course-correct LRU inversions across tiers over time, because LRU > threads only scan invidiual nodes and don't compare across nodes. Right. > > I would hazard against trying to wholesale state it "Shall be the single > source of truth", as we will inevitably discover some condition which is > not covered / cannot be captured / we will simply get it wrong. Yeah. The ideal goal of single source of truth may be a bit far fetched but pghot is definitely a subsystem that can work with multiple page hotness sources, aggregate hot signals from them and provide a single unified promotion mechanism. Regards, Bharata.
On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: > This is v7 of pghot, a hot-page tracking and promotion subsystem. The I continue to think we should not do this.
On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote: > On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: > > This is v7 of pghot, a hot-page tracking and promotion subsystem. The > > I continue to think we should not do this. My only pushback on the general "we should not do this" is that we need something to counter-balance the demotion bit in vmscan.c, and the current implementation (prot_none faults) is rather :[ I think this series needs to greatly limit its complexity and provide some gentle correction for LRU inversions, and I think they're making a decent attempt at that. But then I think local memory expansion on CXL is going pretty swimmingly in our datacenters :], others may not feel the same. ~Gregory
On 06-May-26 8:52 PM, Gregory Price wrote: > On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote: >> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: >>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The >> >> I continue to think we should not do this. > > My only pushback on the general "we should not do this" is that we need > something to counter-balance the demotion bit in vmscan.c, and the > current implementation (prot_none faults) is rather :[ So you are saying pghot subsystem currently does hot page detection and promotion only, which is fine. But the current implementation of demotion is not very optimal and hence we should spend effort in fine-tuning demotion first? In this series itself I have shown via benchmark numbers that for over-committed cases (involving both demotion and promotion), the workload isn't really showing real benefit due to demotion and promotion. Are you specifically referring to this problem? > > I think this series needs to greatly limit its complexity and provide > some gentle correction for LRU inversions, and I think they're making a > decent attempt at that. Regarding complexity, I agree that the initial version of this patchset was quite complicated in the way it maintained hot page information. But the later versions including this one have greatly reduced the complexity with one byte of hot page information per PFN, atomic updates to hotness data without any locks, per-lowertier kmigrated threads for promotion and reuse of existing hot page promotion engine. Did you have anything else in mind wrt complexity? Can you provide more context about the LRU inversion problem? Regards, Bharata.
On Mon, May 11, 2026 at 03:32:20PM +0530, Bharata B Rao wrote:
>
>
> On 06-May-26 8:52 PM, Gregory Price wrote:
> > On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote:
> >> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
> >>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
> >>
> >> I continue to think we should not do this.
> >
> > My only pushback on the general "we should not do this" is that we need
> > something to counter-balance the demotion bit in vmscan.c, and the
> > current implementation (prot_none faults) is rather :[
>
> So you are saying pghot subsystem currently does hot page detection and
> promotion only, which is fine. But the current implementation of demotion is not
> very optimal and hence we should spend effort in fine-tuning demotion first?
>
I'm saying because of demotion and fallbacks, we need a mechanism to
handle promotions. I'm not convinced a hotness will extend to coldness
- at least any better than LRU/MGLRU.
> In this series itself I have shown via benchmark numbers that for over-committed
> cases (involving both demotion and promotion), the workload isn't really showing
> real benefit due to demotion and promotion. Are you specifically referring to
> this problem?
>
If over-committed means over-subscribed hot-tier (more hot memory than
available top tier memory), then yeah that result is intuitive. I
haven't pointed to any specific issue, as of yet, still taking time to
consider some of the results.
>
> Can you provide more context about the LRU inversion problem?
>
I've been tracking some data around shrink_folio_list and
alloc_migrate_folio behavior when a low tier node is full.
The result is we end up just swapping memory from high tier straight to
swap and skip demotion, resulting in a bunch of file and anon refaults.
Hardware: Single Socket, 768GB DRAM, 256GB CXL Expander
In this workload, we see swap usage after the full 1TB of memory is
utilized, and as a result we see swap spillage.
second_chance = second alloc attempt in alloc_migrate_folio succeeds
swap_fallback = second chance fails, we swap directly from top tier
Sample data:
pgdemote_kswapd 333052779
pgdemote_direct 3181480482
pgdemote_second_chance 31017629
pgdemote_swap_fallback 335759535
workingset_refault_anon 30106868
workingset_refault_file 2343035341
(note here: swap fallback is number of occurances, while the others are
number of pages. As a result, the actual number of swapped pages is
likely much closer to the pgdemote_direct number)
As a result: LRU is just broken on CXL systems, LRU inverts by design.
In a sane world we would just see the second tier as an extention of the
LRU, but that doesn't necessarily mean we can gleen hotness data from it
(it's still largely a coldness tracking mechanism).
I have patches I haven't RFC'd yet that try to address this, but I need
more time to test it.
I don't think this is something to address with PGHot.
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 112983b42559..ccdd698c5937 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1043,7 +1043,10 @@ struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
mtc->gfp_mask &= ~__GFP_THISNODE;
mtc->nmask = allowed_mask;
- return alloc_migration_target(src, (unsigned long)mtc);
+ dst = alloc_migration_target(src, (unsigned long)mtc);
+ if (dst)
+ count_vm_events(PGDEMOTE_SECOND_CHANCE, folio_nr_pages(src));
+ return dst;
}
/*
@@ -1616,6 +1619,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
/* Folios that could not be demoted are still in @demote_folios */
if (!list_empty(&demote_folios)) {
/* Folios which weren't demoted go back on @folio_list */
+ if (!sc->proactive)
+ count_vm_event(PGDEMOTE_SWAP_FALLBACK);
list_splice_init(&demote_folios, folio_list);
/*
On 5/5/26 06:36, Matthew Wilcox wrote: > On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: >> This is v7 of pghot, a hot-page tracking and promotion subsystem. The > > I continue to think we should not do this. > I am unclear about the benefits of the patchset, I have not tested it or reviewed the latest revision. My big concern was that top-tier might not always be suitable. I see that there are some numbers posted, but I find this weird "After the graph creation, the processes are stopped and data is migrated to CXL node 2 before continuing so that BFS phase starts accessing lower tier memory." Why not allocate everything on CXL node 2? Balbir
On 06-May-26 3:47 AM, Balbir Singh wrote: > On 5/5/26 06:36, Matthew Wilcox wrote: >> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: >>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The >> >> I continue to think we should not do this. >> > > I am unclear about the benefits of the patchset, I have not tested > it or reviewed the latest revision. My big concern was that top-tier > might not always be suitable. So you are saying that we should have a capability to promote accessed pages from lower tier to an other tier that is not classified as top tier? Is that non-top tier node the one which generates accesses? > > I see that there are some numbers posted, but I find this weird > "After the graph creation, the processes are stopped and data is migrated > to CXL node 2 before continuing so that BFS phase starts accessing lower > tier memory." Why not allocate everything on CXL node 2? In the ideal scenario, the benefit is to see if any pages that land up on lower tier get identified as hot and get promoted. That means we need to create an over-committed scenario where the pages get demoted first. I have provided numbers from such cases in my previous versions. The problem with this case is that the base hot page promotion (NUMAB2) hasn't shown any benefit at all with my micro-benchmark - Ref: https://lore.kernel.org/linux-mm/868004d8-bb8e-4800-9fdd-ade48e95fe3b@amd.com/ Same has been observed with redis-memtier benchmark - https://lore.kernel.org/linux-mm/957f2242-56d4-4bf0-8aeb-9d60fbea8c8c@amd.com/ Instead what I am doing here is to take out demotion from the scenario but still retain the access pattern of the benchmark by pushing out the data to lower tier when the benchmark reaches steady allocation state. Regards, Bharata.
On 5/6/26 13:43, Bharata B Rao wrote: > On 06-May-26 3:47 AM, Balbir Singh wrote: >> On 5/5/26 06:36, Matthew Wilcox wrote: >>> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: >>>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The >>> >>> I continue to think we should not do this. >>> >> >> I am unclear about the benefits of the patchset, I have not tested >> it or reviewed the latest revision. My big concern was that top-tier >> might not always be suitable. > > So you are saying that we should have a capability to promote accessed pages > from lower tier to an other tier that is not classified as top tier? Is that > non-top tier node the one which generates accesses? > Yes, a top tier node could be CPU less for example. >> >> I see that there are some numbers posted, but I find this weird >> "After the graph creation, the processes are stopped and data is migrated >> to CXL node 2 before continuing so that BFS phase starts accessing lower >> tier memory." Why not allocate everything on CXL node 2? > > In the ideal scenario, the benefit is to see if any pages that land up on lower > tier get identified as hot and get promoted. That means we need to create an > over-committed scenario where the pages get demoted first. I have provided Why do the pages need to get demoted? Why not allocate them from the lower tier to show that promotion upwards is helpful > numbers from such cases in my previous versions. The problem with this case is > that the base hot page promotion (NUMAB2) hasn't shown any benefit at all with > my micro-benchmark - Ref: > https://lore.kernel.org/linux-mm/868004d8-bb8e-4800-9fdd-ade48e95fe3b@amd.com/ > > Same has been observed with redis-memtier benchmark - > https://lore.kernel.org/linux-mm/957f2242-56d4-4bf0-8aeb-9d60fbea8c8c@amd.com/ > > Instead what I am doing here is to take out demotion from the scenario but still > retain the access pattern of the benchmark by pushing out the data to lower tier > when the benchmark reaches steady allocation state. > Balbir
On 06-May-26 9:32 AM, Balbir Singh wrote: >>> I am unclear about the benefits of the patchset, I have not tested >>> it or reviewed the latest revision. My big concern was that top-tier >>> might not always be suitable. >> >> So you are saying that we should have a capability to promote accessed pages >> from lower tier to an other tier that is not classified as top tier? Is that >> non-top tier node the one which generates accesses? >> > > Yes, a top tier node could be CPU less for example. Currently kmigrated thread in pghot doesn't explicitly prevent promotion to non-toptier nodes. Here is how this works for the two modes of operation in pghot: pghot-default: In this mode, the target NID isn't explicitly tracked and hence kmigrated relies on the user-configurable pghot_target_nid. Though there is a !node_is_toptier(nid) check in the helper routine that populates pghot_target_nid, that can be relaxed if required. pghot-precise: In this mode, the accessing CPU's node is tracked as the target nid and promotion is done to that node. Note that pghot_target_nid isn't used here. Hence I don't see any major issues in this patchset to cover your use case. Let me know if I miss anything here. BTW, does the existing hot page promotion cover the use case you are targeting? > >>> >>> I see that there are some numbers posted, but I find this weird >>> "After the graph creation, the processes are stopped and data is migrated >>> to CXL node 2 before continuing so that BFS phase starts accessing lower >>> tier memory." Why not allocate everything on CXL node 2? >> >> In the ideal scenario, the benefit is to see if any pages that land up on lower >> tier get identified as hot and get promoted. That means we need to create an >> over-committed scenario where the pages get demoted first. I have provided > > Why do the pages need to get demoted? Why not allocate them from the lower tier > to show that promotion upwards is helpful As you can see, these are controlled experiments to measure the effectiveness of hot page detection and promotion and the benefits from promotion. It can be done in the way you are suggesting; just that I found it a bit simpler to pause the benchmark, migrate all pages to lower tier memory before the benchmark starts accessing them rather than relying on setting memory policies to achieve the same effect. Regards, Bharata.
On 04-May-26 11:39 AM, Bharata B Rao wrote:
>
> Results
> =======
> Posted as replies to this mail thread.
Micro-benchmark numbers for IBS Memory Profiler pghot source
Test system details
-------------------
2 node AMD system with 1 regular NUMA node (0) and a CXL node (1)
$ numactl -H
available: 3 nodes (0-1)
node 0 cpus: 0-255
node 0 size: 515563 MB
node 1 cpus:
node 1 size: 258034 MB
node distances:
node distances:
node 0 1
0: 10 50
1: 255 10
Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
in the patched case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case
HWHINTS - IBS Memory Profiler as source for pghot
Pghot by default promotes after two accesses but for NUMAB2 source,
promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)
==============================================================
Scenario 1 - Enough memory in toptier and hence only promotion
==============================================================
Multi-threaded application with 64 threads that access memory(8G) at
4K granularity repetitively and randomly. The number of accesses per
thread and the randomness pattern for each thread are fixed beforehand.
The accesses are divided into stores and loads in the ratio of 50:50.
Benchmark threads run on Node 0, while memory is initially provisioned on
CXL node 1 before the accesses start.
Repetitive accesses results in lowertier pages becoming hot and kmigrated
detecting and migrating them. The benchmark score is the time taken to
finish the accesses in microseconds. The sooner it finishes the better it is.
All the numbers shown below are average of 3 runs.
Time taken (microseconds, lower is better)
---------------------------------------------------------
Source Base Pghot-default
---------------------------------------------------------
NUMAB0 181,393,365 184,331,381
NUMAB2 42,287,528
HWHINTS NA 50,422,862
---------------------------------------------------------
Stats comparision b/n base-NUMAB2 and pghot-default-hwhints
---------------------------------------------------------------------
Base-NUMAB2 Pghot-default-hwhints
---------------------------------------------------------------------
pgpromote_success 2097152 1961087
numa_hint_faults 2358069 0
pghot_recorded_accesses NA 1962696
pghot_recorded_hintfaults NA 0
pghot_recorded_hwhints NA 5532979
hwhint_total_events NA 5532979
---------------------------------------------------------------------
On 04-May-26 11:39 AM, Bharata B Rao wrote:
> Results
> =======
> Posted as replies to this mail thread.
Initial Graph500 benchmark numbers for IBS Memory Profiler source:
Test system details
-------------------
3 node AMD system with 2 regular NUMA nodes (0, 1) in NPS2 mode and a CXL node (2)
$ numactl -H
available: 3 nodes (0-2)
node distances:
node 0 cpus: 0-63,128-191
node 0 size: 257715 MB
node 1 cpus: 64-127,192-255
node 1 size: 257845 MB
node 2 cpus:
node 2 size: 258032 MB
node distances:
node 0 1 2
0: 10 12 50
1: 12 10 50
2: 255 255 10
Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
in the pghot case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
use of hint faults as source in the pghot case.
HWHINTS - IBS Memory Profiler as source for pghot
Pghot by default promotes after two accesses but for NUMAB2 and HWHINTS
sources, promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)
Graph500 details
----------------
Command: mpirun -n 128 --bind-to core --map-by core
graph500/src/graph500_reference_bfs 28 16
After the graph creation, the processes are stopped and data is migrated
to CXL node 2 before continuing so that BFS phase starts accessing lower
tier memory.
Total memory usage is slightly over 100GB and will fit within Node 0 and 1.
Hence there is no memory pressure to induce demotions.
harmonic_mean_TEPS - Higher is better
=============================================================================
Base Base pghot-default
NUMAB0 NUMAB2 NUMAB2
=============================================================================
harmonic_mean_TEPS 4.09614e+08 1.28401e+09 1.47926e+09
mean_time 10.4853 3.34492 2.90342
median_TEPS 4.10086e+08 1.44584e+09 1.85957e+09
max_TEPS 4.1661e+08 1.79773e+09 1.99242e+09
pgpromote_success 0 13746029 13412213
numa_hint_faults 0 13753808 26669823
pghot_recorded_accesses NA NA 26669551
pghot_recorded_hintfaults NA NA 26669823
pghot_recorded_hwhints NA NA 0
hwhint_total_events NA NA 0
=============================================================================
pghot-default
HWHINTS
=============================================================================
harmonic_mean_TEPS 1.52334e+09
mean_time 2.81941
median_TEPS 1.57446e+09
max_TEPS 1.72014e+09
pgpromote_success 3415599
numa_hint_faults 0
pghot_recorded_accesses 3440912
pghot_recorded_hintfaults 0
pghot_recorded_hwhints 24475210
hwhint_total_events 24475244
=============================================================================
While no migration (NUMAB0) at all hurts Graph500, HWHINTS with pghot is able
to provide similar benchmark numbers even when not migrating as aggressively
as base NUMAB2.
© 2016 - 2026 Red Hat, Inc.