[RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure

Bharata B Rao posted 5 patches 1 week, 3 days ago
Documentation/admin-guide/mm/pghot.txt |  80 ++++
include/linux/migrate.h                |  10 +-
include/linux/mm.h                     |  35 +-
include/linux/mmzone.h                 |  24 +-
include/linux/pghot.h                  | 113 +++++
include/linux/vm_event_item.h          |   5 +
init/Kconfig                           |  13 +
kernel/sched/core.c                    |   7 +
kernel/sched/debug.c                   |   1 -
kernel/sched/fair.c                    | 177 +------
kernel/sched/sched.h                   |   1 -
mm/Kconfig                             |  25 +
mm/Makefile                            |   6 +
mm/huge_memory.c                       |  27 +-
mm/memcontrol.c                        |   6 +-
mm/memory-tiers.c                      |  15 +-
mm/memory.c                            |  36 +-
mm/mempolicy.c                         |   3 -
mm/migrate.c                           |  88 +++-
mm/mm_init.c                           |  10 +
mm/pghot-default.c                     |  79 ++++
mm/pghot-precise.c                     |  81 ++++
mm/pghot-tunables.c                    | 182 ++++++++
mm/pghot.c                             | 618 +++++++++++++++++++++++++
mm/vmstat.c                            |   7 +-
25 files changed, 1411 insertions(+), 238 deletions(-)
create mode 100644 Documentation/admin-guide/mm/pghot.txt
create mode 100644 include/linux/pghot.h
create mode 100644 mm/pghot-default.c
create mode 100644 mm/pghot-precise.c
create mode 100644 mm/pghot-tunables.c
create mode 100644 mm/pghot.c
[RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 week, 3 days ago
Hi,

This is v6 of pghot, a hot-page tracking and promotion subsystem. The
main changes in this version are to retain only hintfault as source
and a lot of cleanups, fixes and restructuring.

This patchset introduces a new subsystem for hot page tracking and
promotion (pghot) with the following goals:

- Unify hot page detection from multiple sources like hint faults,
  page table scans, hardware hints (AMD IBS).
- Decouple detection from migration.
- Centralize promotion logic via per-lower-tier-node kmigrated kernel
  thread.
- Move promotion rate‑limiting and related logic used by
  numa_balancing=2 (NUMAB2, the current NUMA balancing–based promotion)
  from the scheduler to pghot for broader reuse.
  
Currently, multiple kernel subsystems detect page accesses independently.
This patchset consolidates accesses from these mechanisms by providing:

- A common API for reporting page accesses.
- Shared infrastructure for tracking hotness at PFN granularity.
- Per-lower-tier-node kernel threads for promoting pages.

Here is a brief summary of how this subsystem works:

- Tracks frequency and last access time.
- Additionally, the accessing NUMA node ID (NID) for each recorded
  access is also tracked in the precision mode.
- These hotness parameters are maintained in a per-PFN hotness record
  within the existing mem_section data structure.
  - In default mode, one byte (u8) is used for hotness record. 5 bits are
    used to store time and bucketing scheme is used to represent a total
    access time up to 4s with HZ=1000. Default toptier NID (0) is used as
    the target for promotion which can be changed via debugfs tunable.
  - In precision mode, 4 bytes (u32) are used for each hotness record.
    14 bits are used to store time which can represent around 16s
    with HZ=1000.
- Classifies pages as hot based on configurable thresholds.
- Pages classified as hot are marked as ready for migration using the
  ready bit. Both modes use MSB of the hotness record as ready bit.
- Per-lower-tier-node kmigrated threads periodically scan the PFNs of
  lower-tier nodes, checking for the migration-ready bit to perform
  batched migrations. Interval between successive scans and batching
  value are configurable via debugfs tunables.

Memory overhead
---------------
Default mode: 1 byte per lower-tier PFN. For a 1TB lower-tier memory
this amounts to 256MB overhead (assuming 4K pages)

Precision mode: 4 bytes per lower-tier PFN. For a 1TB of lower memory
this amounts to 1G overhead.

Bit layout of hotness record
----------------------------
Default mode
- Bits 0-1: Frequency (2bits, 4 access samples)
- Bits 2-6: Bucketed time (5bits, up to 4s with HZ=1000)
- Bit 7: Migration ready bit

Precision mode
- Bits 0-9: Target NID (10 bits)
- Bits 10-12: Frequency (3bits, 8 access samples)
- Bits 13-26: Time (14bits, up to 16s with HZ=1000)
- Bits 27-30: Reserved
- Bit 31: Migration ready bit

Potential hotness sources
-------------------------
1. NUMA Balancing (NUMAB2, Tiering mode)
2. IBS - Instruction Based Sampling, hardware based sampling
   mechanism present on AMD CPUs.
3. klruscand - PTE‑A bit scanning built on MGLRU’s walk helpers.
4. folio_mark_accessed() - Page cache access tracking (unmapped
   page cache pages)

Changes in v6
=============
- While earlier versions included sample implementation for all
  the hotness sources listed above, I have retained only NUMAB2
  in this iteration and will include the hardened versions
  of them in subsequent iterations.
- Cleaned up NUMAB2 implementation by removing the unused code
  (like access time tracking code etc).
- Ensured that NUMAB1 mode works as before. (Earlier versions made
  NUMA hint fault handler work only for NUMAB2 version)
- NUMA Balancing tiering mode is moved to its own new config
  CONFIG_NUMA_BALANCING_TIERING to make code sharing between
  NUMA Balancing and pghot easier.
- A bunch of hot page promotion related stats now depend upon
  CONFIG_PGHOT as the promotion engine is part of pghot.
- Fixed kmigrated to take a reference on the folio when walking
  the PFNs checking for migrate-ready folios.
- Fixed speculative folio access issue reported by Chris Mason's
  review-prompts.
- Added per-memcg NUMA_PAGE_MIGRATE stats accounting for batch
  migration API too.
- Added support for initializing hot_maps in the newly added
  sections during memory hotplug. 
- Default hotness threshold window changed from 4s to 3s as the
  maximum time representable in default mode is 3.9s only.
- Lot of cleanups and code restructuring.

Results
=======
Posted as replies to this mail thread.

This v6 patchset applies on top of upstream commit a989fde763f4f and
can be fetched from:

https://github.com/AMDESE/linux-mm/tree/bharata/pghot-rfcv6

v5: https://lore.kernel.org/linux-mm/20260129144043.231636-1-bharata@amd.com/
v4: https://lore.kernel.org/linux-mm/20251206101423.5004-1-bharata@amd.com/
v3: https://lore.kernel.org/linux-mm/20251110052343.208768-1-bharata@amd.com/
v2: https://lore.kernel.org/linux-mm/20250910144653.212066-1-bharata@amd.com/
v1: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/
v0: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/

Bharata B Rao (4):
  mm: migrate: Allow misplaced migration without VMA
  mm: Hot page tracking and promotion - pghot
  mm: pghot: Precision mode for pghot
  mm: sched: move NUMA balancing tiering promotion to pghot

Gregory Price (1):
  mm: migrate: Add migrate_misplaced_folios_batch()

 Documentation/admin-guide/mm/pghot.txt |  80 ++++
 include/linux/migrate.h                |  10 +-
 include/linux/mm.h                     |  35 +-
 include/linux/mmzone.h                 |  24 +-
 include/linux/pghot.h                  | 113 +++++
 include/linux/vm_event_item.h          |   5 +
 init/Kconfig                           |  13 +
 kernel/sched/core.c                    |   7 +
 kernel/sched/debug.c                   |   1 -
 kernel/sched/fair.c                    | 177 +------
 kernel/sched/sched.h                   |   1 -
 mm/Kconfig                             |  25 +
 mm/Makefile                            |   6 +
 mm/huge_memory.c                       |  27 +-
 mm/memcontrol.c                        |   6 +-
 mm/memory-tiers.c                      |  15 +-
 mm/memory.c                            |  36 +-
 mm/mempolicy.c                         |   3 -
 mm/migrate.c                           |  88 +++-
 mm/mm_init.c                           |  10 +
 mm/pghot-default.c                     |  79 ++++
 mm/pghot-precise.c                     |  81 ++++
 mm/pghot-tunables.c                    | 182 ++++++++
 mm/pghot.c                             | 618 +++++++++++++++++++++++++
 mm/vmstat.c                            |   7 +-
 25 files changed, 1411 insertions(+), 238 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/pghot.txt
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/pghot-default.c
 create mode 100644 mm/pghot-precise.c
 create mode 100644 mm/pghot-tunables.c
 create mode 100644 mm/pghot.c

base-commit: a989fde763f4f24209e4702f50a45be572340e68
-- 
2.34.1

Re: [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 week, 3 days ago
NAS Parallel Benchmarks - BT results

Test system details
-------------------
3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2)

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-95,192-287
node 0 size: 128460 MB
node 1 cpus: 96-191,288-383
node 1 size: 128893 MB
node 2 cpus:
node 2 size: 257993 MB
node distances:
node   0   1   2
  0:  10  32  50
  1:  32  10  60
  2:  255  255  10

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
         in the pghot case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
         use of hint faults as source in the pghot case.
         Both promotion and demotion are enabled in this case.
NUMAB3 - Enabled both regular and tiering mode of NUMA Balancing
         (kernel.numa_balancing=3)

Pghot by default promotes after two accesses but for NUMAB2 source,
promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)

NAS-BT details
--------------
Command: mpirun -np 16 /usr/bin/numactl --cpunodebind=0,1
NPB3.4.4/NPB3.4-MPI/bin/bt.F.x

While class D uses around 24G of memory (which is too less to show the benefit
of promotion), class E results in around 368G of memory which overflows my
toptier. Hence I wanted something in between these classes. So I have  modified
class F to the problem size of 768 which results in around 160GB of memory.

After the memory consumption stabilizes, all the rank PIDs are paused and
their memory is moved to CXL node using migratepages command. This simulates
the situation of memory residing on lower tier node and access by BT processes
leading to promotion.

Time in seconds - Lower is better
Mop/s total - Higher is better
=====================================================================================
                        Base            Base            pghot-default
pghot-precise
                        NUMAB0          NUMAB2          NUMAB2          NUMAB2
=====================================================================================
Time in seconds         7321.79         4333.85         6498.78         4386.27
Mop/s total             53451.77        90303.780       60221.01        89224.51

pgpromote_success       0               41971151        423163051       41957809
pgpromote_candidate     0               0               1870949786      0
pgpromote_candidate_nrl 0               41971151        29360089        41957809
pgdemote_kswapd         0               0               391179763       0
numa_pte_updates        0               42041312        1919944389      2568923206
numa_hint_faults        0               41972330        1911683592      2562729196
=====================================================================================
                                                        pghot-default
                                                        NUMAB3
=====================================================================================
Time in seconds                                         4425.84
Mop/s total                                             88426.77

pgpromote_success                                       41957442
pgpromote_candidate                                     0
pgpromote_candidate_nrl                                 41957442
pgdemote_kswapd                                         0
numa_pte_updates                                        2588634775
numa_hint_faults                                        2581645889
=====================================================================================

- In the base case, the benchmark numbers improve significantly due to hot page
  promotion.
- Though the benchmark runs for hundreds of minutes, the pages get promoted
  within the first few mins.
- pghot-precise is able to match the base case numbers.
- The benchmark suffers in pghot-default case due to promotion being limited
  to the default NID (0) only. This leads to excessive PTE updates, hint faults,
  demotion and promotion churn.
- With NUMAB3, pghot-default case recovers the performce as in this mode
  misplaced hot page migrations get right placed due to NUMA balancing
  mode=1 being active.
Re: [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 week, 3 days ago
Graph500 results

Test system details
-------------------
3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2)

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-95,192-287
node 0 size: 128460 MB
node 1 cpus: 96-191,288-383
node 1 size: 128893 MB
node 2 cpus:
node 2 size: 257993 MB
node distances:
node   0   1   2
  0:  10  32  50
  1:  32  10  60
  2:  255  255  10

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
         in the pghot case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
         use of hint faults as source in the pghot case.
NUMAB3 - Enabled both regular and tiering mode of NUMA Balancing
         (kernel.numa_balancing=3)

Pghot by default promotes after two accesses but for NUMAB2 source,
promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)

Graph500 details
----------------
Command: mpirun -n 128 --bind-to core --map-by core
graph500/src/graph500_reference_bfs 28 16

After the graph creation, the processes are stopped and data is migrated
to CXL node 2 before continuing so that BFS phase starts accessing lower
tier memory.

Total memory usage is slightly over 100GB and will fit within Node 0 and 1.
Hence there is no memory pressure to induce demotions.

harmonic_mean_TEPS - Higher is better
=====================================================================================
                        Base            Base            pghot-default
pghot-precise
                        NUMAB0          NUMAB2          NUMAB2          NUMAB2
=====================================================================================
harmonic_mean_TEPS      5.07693e+08     7.08679e+08     5.56854e+08     7.39417e+08
mean_time               8.45968         6.06046         7.71283         5.80853
median_TEPS             5.08914e+08     7.23181e+08     5.51614e+08     7.58993e+08
max_TEPS                5.15226e+08     1.01654e+09     7.75233e+08     9.69136e+08

pgpromote_success       0               13797978        13746431        13752523
numa_pte_updates        0               26727341        39998363        48374479
numa_hint_faults        0               13798301        24459996        32728927
=====================================================================================
                                                        pghot-default
                                                        NUMAB3
=====================================================================================
harmonic_mean_TEPS                                      7.18678e+08
mean_time                                               5.97614
median_TEPS                                             7.376e+08
max_TEPS                                                7.47337e+08

pgpromote_success                                       13821625
numa_pte_updates                                        93534398
numa_hint_faults                                        69164048
=====================================================================================
- The base case shows a good improvement with NUMAB2 in harmonic_mean_TEPS.
- The same improvement gets maintained with pghot-precise too.
- pghot-default mode doesn't show benefit even when achieving similar page promotion
  numbers. This mode doesn't track accessing NID and by default promotes to NID=0
  which probably isn't all that beneficial as processes are running on both Node 0
  and Node 1.
- pghot-default recovers the performance when balancing between toptier nodes
  0 and 1 is enabled in addition to hot page promotion.
Re: [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 week, 3 days ago
Redis-memtier results

Test system details
-------------------
3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2)

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-95,192-287
node 0 size: 128460 MB
node 1 cpus: 96-191,288-383
node 1 size: 128893 MB
node 2 cpus:
node 2 size: 257993 MB
node distances:
node   0   1   2
  0:  10  32  50
  1:  32  10  60
  2:  255  255  10

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
         in the patched case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
         use of hint faults as source in the patched case.

Pghot by default promotes after two accesses but for NUMAB2 source,
promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)

==============================================================
Scenario 1 - Enough memory in toptier and hence only promotion
==============================================================
In the setup phase, 64GB database is provisioned and explicitly moved
to Node 2 by migrating redis-server's memory to Node 2.
Memtier is run on Node 1.

Parallel distribution, 50% of the keys accessed, each 4 times.
16        Threads
100       Connections per thread
77808     Requests per client

==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9
Latency       KB/sec
--------------------------------------------------------------------------------------------------
Base, NUMAB0
Totals     226611.42       225.92873       224.25500       423.93500
454.65500    514886.68
--------------------------------------------------------------------------------------------------
Base, NUMAB2
Totals     257211.48       204.99755       216.06300       370.68700
454.65500    584413.47
--------------------------------------------------------------------------------------------------
pghot-default, NUMAB2
Totals     255631.78       209.20335       216.06300       378.87900
450.55900    580824.22
--------------------------------------------------------------------------------------------------
pghot-precise, NUMAB2
Totals     249494.46       209.31820       212.99100       380.92700
448.51100    566879.53
==================================================================================================

pgpromote_success
==================================
Base, NUMAB0            0
Base, NUMAB2            10,435,176
pghot-default, NUMAB2   10,435,235
pghot-precise, NUMAB2   10,435,294
==================================

- There is a clear benefit of hot page promotion seen. Both
  base and pghot show similar benefits.
- The number of pages promoted in both cases are more or less
  same.

==============================================================
Scenario 2 - Toptier memory overcommited, promotion + demotion
==============================================================
In the setup phase, 192GB database is provisioned. The database occupies
Node 1 entirely(~128GB) and spills over to Node 2 (~64GB).
Memtier is run on Node 1.

Parallel distribution, 50% of the keys accessed, each 4 times.
16        Threads
100       Connections per thread
233424    Requests per client

==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9
Latency       KB/sec
--------------------------------------------------------------------------------------------------
Base, NUMAB0
Totals     237743.40       217.72842       201.72700       395.26300
440.31900    540389.78
--------------------------------------------------------------------------------------------------
Base, NUMAB2
Totals     235935.72       219.36544       210.94300       411.64700
477.18300    536280.93
--------------------------------------------------------------------------------------------------
pghot-default, NUMAB2
Totals     248283.99       219.74875       211.96700       413.69500
509.95100    564348.49
--------------------------------------------------------------------------------------------------
pghot-precise, NUMAB2
Totals     240529.35       222.11878       215.03900       411.64700
464.89500    546722.22
==================================================================================================
                        pgpromote_success       pgdemote_kswapd
===============================================================
Base, NUMAB0            0                       672,591
Base, NUMAB2            350,632                 689,751
pghot-default, NUMAB2   17,118,987              17,421,474
pghot-precise, NUMAB2   24,030,292              24,342,569
===============================================================

- No clear benefit is seen with hot page promotion both in base and pghot case.
- Most promotion attempts in base case fail because the NUMA hint fault latency
  is found to exceed the threshold value (default threshold of 1000ms) in
  majority of the promotion attempts.
- Unlike base NUMAB2 where the hint fault latency is the difference between the
  PTE update time (during scanning) and the access time (hint fault), pghot uses
  a single latency threshold (3000ms in pghot-default and 5000ms in
  pghot-precise) for two purposes.
        1. If the time difference between successive accesses are within the
           threshold, the page is marked as hot.
        2. Later when kmigrated picks up the page for migration, it will migrate
           only if the difference between the current time and the time when the
          page was marked hot is with the threshold.
  Because of the above difference in behaviour, more number of pages get
  qualified for promotion compared to base NUMAB2.
Re: [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure
Posted by Bharata B Rao 1 week, 3 days ago
Microbenchmark results

Test system details
-------------------
3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2)

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-95,192-287
node 0 size: 128460 MB
node 1 cpus: 96-191,288-383
node 1 size: 128893 MB
node 2 cpus:
node 2 size: 257993 MB
node distances:
node   0   1   2
  0:  10  32  50
  1:  32  10  60
  2:  255  255  10

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
         in the patched case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
         use of hint faults as source in the patched case.

Pghot by default promotes after two accesses but for NUMAB2 source,
promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)

==============================================================
Scenario 1 - Enough memory in toptier and hence only promotion
==============================================================
Multi-threaded application with 64 threads that access memory at 4K granularity
repetitively and randomly. The number of accesses per thread and the randomness
pattern for each thread are fixed beforehand. The accesses are divided into
stores and loads in the ratio of 50:50.

Benchmark threads run on Node 0, while memory is initially provisioned on
CXL node 2 before the accesses start.

Repetitive accesses results in lowertier pages becoming hot and kmigrated
detecting and migrating them. The benchmark score is the time taken to finish
the accesses in microseconds. The sooner it finishes the better it is. All the
numbers shown below are average of 3 runs.

Default mode - Time taken (microseconds, lower is better)
---------------------------------------------------------
Source          Base            Pghot
---------------------------------------------------------
NUMAB0          119,658,562     118,037,791
NUMAB2          104,205,571     102,705,330
---------------------------------------------------------

Default mode - Pages migrated (pgpromote_success)
---------------------------------------------------------
Source          Base            Pghot
---------------------------------------------------------
NUMAB0          0               0
NUMAB2          2097152         2097152
---------------------------------------------------------

Precision mode - Time taken (microseconds, lower is better)
-----------------------------------------------------------
Source          Base            Pghot
-----------------------------------------------------------
NUMAB0          119,658,562     115,173,151
NUMAB2          104,205,571     102,194,435
-----------------------------------------------------------

Precision mode - Pages migrated (pgpromote_success)
---------------------------------------------------
Source          Base            Pghot
---------------------------------------------------
NUMAB0          0               0
NUMAB2          2097152         2097152
---------------------------------------------------

Rate of migration (pgpromote_success)
-----------------------------------------
Time(s)         Base            Pghot
-----------------------------------------
0               0               0
28              0               0
32              262144          262144
36              524288          469012
40              786432          720896
44              1048576         983040
48              1310720         1245184
52              1572864         1507328
56              1835008         1769472
60              2097152         2031616
64              2097152         2097152
-----------------------------------------

==============================================================
Scenario 2 - Toptier memory overcommited, promotion + demotion
==============================================================
Single threaded application that allocates memory on both DRAM and CXL nodes
using mmap(MAP_POPULATE). Every 1G region of allocated memory on CXL node is
accessed at 4K granularity randomly and repetitively to build up the notion
of hotness in the 1GB region that is under access. This should drive promotion.
For promotion to work successfully, the DRAM memory that has been provisioned
(and not being accessed) should be demoted first. There is enough free memory
in the CXL node to for demotions.

In summary, this benchmark creates a memory pressure on DRAM node and does
CXL memory accesses to drive both demotion and promotion.

The number of accesses are fixed and hence, the quicker the accessed pages
get promoted to DRAM, the sooner the benchmark is expected to finish.
All the numbers shown below are average of 3 runs.

DRAM-node                       = 1
CXL-node                        = 2
Initial DRAM alloc ratio        = 75%
Allocation-size                 = 171798691840
Initial DRAM Alloc-size         = 128849018880
Initial CXL Alloc-size          = 42949672960
Hot-region-size                 = 1073741824
Nr-regions                      = 160
Nr-regions DRAM                 = 120 (provisioned but not accessed)
Nr-hot-regions CXL              = 40
Access pattern                  = random
Access granularity              = 4096
Delay b/n accesses              = 0
Load/store ratio                = 50l50s
THP used                        = no
Nr accesses                     = 42949672960
Nr repetitions                  = 1024

Default mode - Time taken (microseconds, lower is better)
------------------------------------------------------
Source          Base            Pghot
------------------------------------------------------
NUMAB0          61,028,534      59,432,137
NUMAB2          63,070,998      61,375,763
------------------------------------------------------

Default mode - Pages migrated (pgpromote_success)
-------------------------------------------------
Source          Base            Pghot
-------------------------------------------------
NUMAB0          0               0
NUMAB2          26546           1070842 (High R2R variation in Base)
---------------------------------------

Precision mode - Time taken (microseconds, lower is better)
------------------------------------------------------
Source          Base            Pghot
------------------------------------------------------
NUMAB0          61,028,534      60,354,547
NUMAB2          63,070,998      60,199,147
------------------------------------------------------

Precision mode - Pages migrated (pgpromote_success)
---------------------------------------------------
Source          Base            Pghot
---------------------------------------------------
NUMAB0          0               0
NUMAB2          26546           1088621 (High R2R variation in Base)
---------------------------------------------------

- The base case itself doesn't show any improvement in benchmark numbers due
  to hot page promotion. The same pattern is seen in pghot case with all
  the sources except hwhints. The benchmark itself may need tuning so that
  promotion helps.
- There is a high run to run variation in the number of pages promoted in
  base case.
- Most promotion attempts in base case fail because the NUMA hint fault
  latency is found to exceed the threshold value (default threshold
  is 1000ms) in majority of the promotion attempts.
- Unlike base NUMAB2 where the hint fault latency is the difference between the
  PTE update time (during scanning) and the access time (hint fault), pghot uses
  a single latency threshold (3000ms in pghot-default and 5000ms in
  pghot-precise) for two purposes.
        1. If the time difference between successive accesses are within the
           threshold, the page is marked as hot.
        2. Later when kmigrated picks up the page for migration, it will migrate
           only if the difference between the current time and the time when the
          page was marked hot is with the threshold.