[RFC PATCH v1 0/7] A subsystem for hot page detection and promotion

Bharata B Rao posted 7 patches 1 month, 3 weeks ago
arch/x86/events/amd/ibs.c           |  11 +
arch/x86/include/asm/entry-common.h |   3 +
arch/x86/include/asm/hardirq.h      |   2 +
arch/x86/include/asm/ibs.h          |   9 +
arch/x86/include/asm/msr-index.h    |  16 +
arch/x86/mm/Makefile                |   3 +-
arch/x86/mm/ibs.c                   | 343 +++++++++++++++++++
include/linux/migrate.h             |   6 +
include/linux/mmzone.h              |  16 +
include/linux/pghot.h               |  87 +++++
include/linux/vm_event_item.h       |  26 ++
mm/Kconfig                          |  19 ++
mm/Makefile                         |   2 +
mm/internal.h                       |   4 +
mm/klruscand.c                      | 118 +++++++
mm/migrate.c                        |  36 +-
mm/mm_init.c                        |  10 +
mm/pghot.c                          | 501 ++++++++++++++++++++++++++++
mm/vmscan.c                         | 176 +++++++---
mm/vmstat.c                         |  26 ++
20 files changed, 1365 insertions(+), 49 deletions(-)
create mode 100644 arch/x86/include/asm/ibs.h
create mode 100644 arch/x86/mm/ibs.c
create mode 100644 include/linux/pghot.h
create mode 100644 mm/klruscand.c
create mode 100644 mm/pghot.c
[RFC PATCH v1 0/7] A subsystem for hot page detection and promotion
Posted by Bharata B Rao 1 month, 3 weeks ago
Hi,

This patchset is about adding a dedicated sub-system for maintaining
hot pages information from the lower tiers and promoting the hot pages
to the top tiers. It exposes an API that other sub-systems which detect
accesses, can use to report the accesses for further processing. Further
processing includes system-wide accumulation of memory access info at
PFN granularity, classification the PFNs as hot and promotion of hot
pages using per-node kernel threads. This is a continuation of the
earlier kpromoted work [1] that I posted a while back.

Kernel thread based async batch migration [2] was an off-shoot of
this effort that attempted to batch the migrations from NUMA
balancing by creating a separate kernel thread for migration.
Per-page hotness information was stored as part of extended page
flags. The kernel thread then scanned the entire PFN space to pick
the PFNs that are classified as hot.

The observed challenges from the previous approaches were these:

1. Too many PFNs need to be scanned to identify the hot PFNs in
   approach [2].
2. Hot page records stored in hash lists become unwieldy for
   extracting the required hot pages in approach [1].
3. Dynamic allocation vs static availability of space to store
   per-page hotness information.

This series tries to address challenges 1 and 2 by maintaining
the hot page records in hash lists for quick lookup and maintaining
a separate per-target-node max heap for storing ready-to-migrate
hot page records. The records in heap are priority-ordered based
on "hotness" of the page.

The API for reporting the page access remains unchanged from [1].
When the page access gets recorded, the hotness data of the page
is updated and if it crosses a threshold, it gets tracked in the
heap as well. These heaps are per-target-node and corresponding
migrate threads will periodically extract the top records from
them and do batch migration. 

In the current series, two page temperature sources are included
as examples.

1. IBS based memory access profiler.
2. PTE-A bit based access profiler for MGLRU. (from Kinsey Ho)

TODOs:

- Currently only access frequency is used to calculate the hotness.
  We could have a scalar hotness indicator based on both frequency
  of access and time of access.
- There could be millions of allocation and freeing of records
  and from atomic contexts too. Need to understand how problematic
  this could be. Approach [2] mitigated this by having pre-allocated
  hotness records for each page as part of extended page flags.
- The amount of data needed for tracking hotness is also a concern.
  There is scope for packing the three parameters (nid, time, frequency)
  in a more compact manner which I will attempt in next iterations.
- Migration rate-limiting needs to be added.
- Very very lightly tested atm as the current focus is to get the
  hot data arragement right.

Regards,
Bharata.

[1] Kpromoted - https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
[2] Kmigrated - https://lore.kernel.org/linux-mm/20250616133931.206626-1-bharata@amd.com/

Bharata B Rao (4):
  mm: migrate: Allow misplaced migration without VMA too
  mm: Hot page tracking and promotion
  x86: ibs: In-kernel IBS driver for memory access profiling
  x86: ibs: Enable IBS profiling for memory accesses

Gregory Price (1):
  migrate: implement migrate_misplaced_folios_batch

Kinsey Ho (2):
  mm: mglru: generalize page table walk
  mm: klruscand: use mglru scanning for page promotion

 arch/x86/events/amd/ibs.c           |  11 +
 arch/x86/include/asm/entry-common.h |   3 +
 arch/x86/include/asm/hardirq.h      |   2 +
 arch/x86/include/asm/ibs.h          |   9 +
 arch/x86/include/asm/msr-index.h    |  16 +
 arch/x86/mm/Makefile                |   3 +-
 arch/x86/mm/ibs.c                   | 343 +++++++++++++++++++
 include/linux/migrate.h             |   6 +
 include/linux/mmzone.h              |  16 +
 include/linux/pghot.h               |  87 +++++
 include/linux/vm_event_item.h       |  26 ++
 mm/Kconfig                          |  19 ++
 mm/Makefile                         |   2 +
 mm/internal.h                       |   4 +
 mm/klruscand.c                      | 118 +++++++
 mm/migrate.c                        |  36 +-
 mm/mm_init.c                        |  10 +
 mm/pghot.c                          | 501 ++++++++++++++++++++++++++++
 mm/vmscan.c                         | 176 +++++++---
 mm/vmstat.c                         |  26 ++
 20 files changed, 1365 insertions(+), 49 deletions(-)
 create mode 100644 arch/x86/include/asm/ibs.h
 create mode 100644 arch/x86/mm/ibs.c
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/klruscand.c
 create mode 100644 mm/pghot.c

-- 
2.34.1
Re: [RFC PATCH v1 0/7] A subsystem for hot page detection and promotion
Posted by Balbir Singh 1 month, 2 weeks ago
On 8/14/25 23:48, Bharata B Rao wrote:
> Hi,
> 
> This patchset is about adding a dedicated sub-system for maintaining
> hot pages information from the lower tiers and promoting the hot pages
> to the top tiers. It exposes an API that other sub-systems which detect
> accesses, can use to report the accesses for further processing. Further
> processing includes system-wide accumulation of memory access info at
> PFN granularity, classification the PFNs as hot and promotion of hot
> pages using per-node kernel threads. This is a continuation of the
> earlier kpromoted work [1] that I posted a while back.
> 
> Kernel thread based async batch migration [2] was an off-shoot of
> this effort that attempted to batch the migrations from NUMA
> balancing by creating a separate kernel thread for migration.
> Per-page hotness information was stored as part of extended page
> flags. The kernel thread then scanned the entire PFN space to pick
> the PFNs that are classified as hot.
> 
> The observed challenges from the previous approaches were these:
> 
> 1. Too many PFNs need to be scanned to identify the hot PFNs in
>    approach [2].
> 2. Hot page records stored in hash lists become unwieldy for
>    extracting the required hot pages in approach [1].
> 3. Dynamic allocation vs static availability of space to store
>    per-page hotness information.
> 
> This series tries to address challenges 1 and 2 by maintaining
> the hot page records in hash lists for quick lookup and maintaining
> a separate per-target-node max heap for storing ready-to-migrate
> hot page records. The records in heap are priority-ordered based
> on "hotness" of the page.
> 

Could you elaborate on when/how a page is considered hot? Is it based
on how often a page has been scanned?

> The API for reporting the page access remains unchanged from [1].
> When the page access gets recorded, the hotness data of the page
> is updated and if it crosses a threshold, it gets tracked in the
> heap as well. These heaps are per-target-node and corresponding
> migrate threads will periodically extract the top records from
> them and do batch migration. 
> 

I don't quite follow the heaps and tracking in the heap, could
you please clarify

> In the current series, two page temperature sources are included
> as examples.
> 
> 1. IBS based memory access profiler.
> 2. PTE-A bit based access profiler for MGLRU. (from Kinsey Ho)
> 

Thanks,
Balbir
Re: [RFC PATCH v1 0/7] A subsystem for hot page detection and promotion
Posted by Bharata B Rao 1 month, 2 weeks ago
On 15-Aug-25 5:29 PM, Balbir Singh wrote:
> On 8/14/25 23:48, Bharata B Rao wrote:
>> Hi,
>>
>> This patchset is about adding a dedicated sub-system for maintaining
>> hot pages information from the lower tiers and promoting the hot pages
>> to the top tiers. It exposes an API that other sub-systems which detect
>> accesses, can use to report the accesses for further processing. Further
>> processing includes system-wide accumulation of memory access info at
>> PFN granularity, classification the PFNs as hot and promotion of hot
>> pages using per-node kernel threads. This is a continuation of the
>> earlier kpromoted work [1] that I posted a while back.
>>
>> Kernel thread based async batch migration [2] was an off-shoot of
>> this effort that attempted to batch the migrations from NUMA
>> balancing by creating a separate kernel thread for migration.
>> Per-page hotness information was stored as part of extended page
>> flags. The kernel thread then scanned the entire PFN space to pick
>> the PFNs that are classified as hot.
>>
>> The observed challenges from the previous approaches were these:
>>
>> 1. Too many PFNs need to be scanned to identify the hot PFNs in
>>    approach [2].
>> 2. Hot page records stored in hash lists become unwieldy for
>>    extracting the required hot pages in approach [1].
>> 3. Dynamic allocation vs static availability of space to store
>>    per-page hotness information.
>>
>> This series tries to address challenges 1 and 2 by maintaining
>> the hot page records in hash lists for quick lookup and maintaining
>> a separate per-target-node max heap for storing ready-to-migrate
>> hot page records. The records in heap are priority-ordered based
>> on "hotness" of the page.
>>
> 
> Could you elaborate on when/how a page is considered hot? Is it based
> on how often a page has been scanned?

There are multiple sub-systems within the kernel which detect and
act upon page accesses. NUMA balancing (via hint faults), MGLRU (via
page table scanning for PTE A bit) are examples of the same. The
idea behind this patchset is to consolidate such access information
within a new dedicated sub-system for hot page promotion that
maintains hotness data for accessed pages and promotes them when
a threshold is reached.

Currently I am considering only the number of accesses as an
indicator of page hotness. We need to consider the time of access
too. Both of them should contribute to the eventual "hotness" indicator.
Maybe something similar/analogous to how memory tiering derives
adistance value from bandwidth and latency could be tried out.

> 
>> The API for reporting the page access remains unchanged from [1].
>> When the page access gets recorded, the hotness data of the page
>> is updated and if it crosses a threshold, it gets tracked in the
>> heap as well. These heaps are per-target-node and corresponding
>> migrate threads will periodically extract the top records from
>> them and do batch migration. 
>>
> 
> I don't quite follow the heaps and tracking in the heap, could
> you please clarify

When different sub-systems report page accesses via the API
introduced by this new sub-system, a record for each such page
is stored in hash lists (hashed by PFN value). In addition to
the PFN and target_nid, the hotness record includes parameters
like frequency and time of access from which the hotness is
derived. Repeated reporting of access on the same PFN will result
in updating of hotness information. When the hotness of a
record (as updated during reporting of access) crosses a threshold,
the record becomes part of a max heap data structure. Records
in the max heap are arranged based on the hotness and hence
the top elements of the heap will correspond to the hottest
pages. There will be one such heap for each toptier node so
that per-toptier-node kpromoted thread can easily extract the
top N records from its own heap and perform batched migration.

Hope this clarifies.

Regards,
Bharata.