[RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit

Raghavendra K T posted 10 patches 1 year, 2 months ago
There is a newer version of this series
fs/exec.c                     |    4 +
include/linux/kmmscand.h      |   30 +
include/linux/mm.h            |   14 +
include/linux/mm_types.h      |    4 +
include/linux/vm_event_item.h |   14 +
include/trace/events/kmem.h   |   99 +++
kernel/fork.c                 |    4 +
kernel/sched/fair.c           |   13 +-
mm/Kconfig                    |    7 +
mm/Makefile                   |    1 +
mm/huge_memory.c              |    1 +
mm/kmmscand.c                 | 1144 +++++++++++++++++++++++++++++++++
mm/memory.c                   |   12 +-
mm/vmstat.c                   |   14 +
14 files changed, 1352 insertions(+), 9 deletions(-)
create mode 100644 include/linux/kmmscand.h
create mode 100644 mm/kmmscand.c
[RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit
Posted by Raghavendra K T 1 year, 2 months ago
Introduction:
=============
This patchset is an outcome of an ongoing collaboration between AMD and Meta.
Meta wanted to explore an alternative page promotion technique as they
observe high latency spikes in their workloads that access CXL memory.

In the current hot page promotion, all the activities including the
process address space scanning, NUMA hint fault handling and page
migration is performed in the process context. i.e., scanning overhead is
borne by applications.

This is an early RFC patch series to do (slow tier) CXL page promotion.
The approach in this patchset assists/addresses the issue by adding PTE
Accessed bit scanning.

Scanning is done by a global kernel thread which routinely scans all
the processes' address spaces and checks for accesses by reading the
PTE A bit. It then migrates/promotes the pages to the toptier node
(node 0 in the current approach).

Thus, the approach pushes overhead of scanning, NUMA hint faults and
migrations off from process context.

Initial results show promising number on a microbenchmark.

Experiment:
============
Abench microbenchmark,
- Allocates 8GB/32GB of memory on CXL node
- 64 threads created, and each thread randomly accesses pages in 4K
  granularity.
- 512 iterations with a delay of 1 us between two successive iterations.

SUT: 512 CPU, 2 node 256GB, AMD EPYC.

3 runs, command:  abench -m 2 -d 1 -i 512 -s <size>

Calculates how much time is taken to complete the task, lower is better.
Expectation is CXL node memory is expected to be migrated as fast as
possible.

Base case: 6.11-rc6    w/ numab mode = 2 (hot page promotion is enabled).
patched case: 6.11-rc6 w/ numab mode = 0 (numa balancing is disabled).
we expect daemon to do page promotion.

Result [*]:
========
         base                    patched
         time in sec  (%stdev)   time in sec  (%stdev)     %gain
 8GB     133.66       ( 0.38 )        113.77  ( 1.83 )     14.88
32GB     584.77       ( 0.19 )        542.79  ( 0.11 )      7.17

[*] Please note current patchset applies on 6.13-rc, but these results
are old because latest kernel has issues in populating CXL node memory.
Emailing findings/fix on that soon.

Overhead:
The below time is calculated using patch 10. Actual overhead for patched
case may be even lesser.

               (scan + migration)  time in sec
Total memory   base kernel    patched kernel       %gain
8GB             65.743          13.93              78.8114324
32GB           153.95          132.12              14.17992855

Breakup for 8GB         base    patched
numa_task_work_oh       0.883   0
numa_hf_migration_oh   64.86    0
kmmscand_scan_oh        0       2.74
kmmscand_migration_oh   0      11.19

Breakup for 32GB        base    patched
numa_task_work_oh       4.79     0
numa_hf_migration_oh   149.16    0
kmmscand_scan_oh         0      23.4
kmmscand_migration_oh    0     108.72

Limitations:
===========
PTE A bit scanning approach lacks information about exact destination
node to migrate to.

Notes/Observations on design/Implementations/Alternatives/TODOs...
================================
1. Fine-tuning scan throttling

2. Use migrate_balanced_pgdat() to balance toptier node before migration
 OR Use migrate_misplaced_folio_prepare() directly.
 But it may need some optimizations (for e.g., invoke occasionaly so
that overhead is not there for every migration).

3. Explore if a separate PAGE_EXT flag is needed instead of reusing
PAGE_IDLE flag (cons: complicates PTE A bit handling in the system),
But practically does not look good idea.

4. Use timestamp information-based migration (Similar to numab mode=2).
instead of migrating immediately when PTE A bit set.
(cons:
 - It will not be accurate since it is done outside of process
context.
 - Performance benefit may be lost.)

5. Explore if we need to use PFN information + hash list instead of
simple migration list. Here scanning is directly done with PFN belonging
to CXL node.

6. Holding PTE lock before migration.

7. Solve: how to find target toptier node for migration.

8. Using DAMON APIs OR Reusing part of DAMON which already tracks range of
physical addresses accessed.

9. Gregory has nicely mentioned some details/ideas on different approaches in
[1] : development notes, in the context of promoting unmapped page cache folios.

10. SJ had pointed about concerns about kernel-thread based approaches as in
kstaled [2]. So current patchset has tried to address the issue with simple
algorithms to reduce CPU overhead. Migration throttling, Running the daemon
in NICE priority, Parallelizing migration with scanning could help further.

11. Toptier pages scanned can be used to assist current NUMAB by providing information
on hot VMAs.

Credits
=======
Thanks to Bharata, Joannes, Gregory, SJ, Chris for their valuable comments and
support.

Kernel thread skeleton and some part of the code is hugely inspired by khugepaged
implementation and some part of IBS patches from Bharata [3].

Looking forward for your comment on whether the current approach in this
*early* RFC looks promising, or are there any alternative ideas etc.

Links:
[1] https://lore.kernel.org/lkml/20241127082201.1276-1-gourry@gourry.net/
[2] kstaled: https://lore.kernel.org/lkml/1317170947-17074-3-git-send-email-walken@google.com/#r
[3] https://lore.kernel.org/lkml/Y+Pj+9bbBbHpf6xM@hirez.programming.kicks-ass.net/

I might have CCed more people or less people than needed
unintentionally.

Raghavendra K T (10):
  mm: Add kmmscand kernel daemon
  mm: Maintain mm_struct list in the system
  mm: Scan the mm and create a migration list
  mm/migration: Migrate accessed folios to toptier node
  mm: Add throttling of mm scanning using scan_period
  mm: Add throttling of mm scanning using scan_size
  sysfs: Add sysfs support to tune scanning
  vmstat: Add vmstat counters
  trace/kmmscand: Add tracing of scanning and migration
  kmmscand: Add scanning

 fs/exec.c                     |    4 +
 include/linux/kmmscand.h      |   30 +
 include/linux/mm.h            |   14 +
 include/linux/mm_types.h      |    4 +
 include/linux/vm_event_item.h |   14 +
 include/trace/events/kmem.h   |   99 +++
 kernel/fork.c                 |    4 +
 kernel/sched/fair.c           |   13 +-
 mm/Kconfig                    |    7 +
 mm/Makefile                   |    1 +
 mm/huge_memory.c              |    1 +
 mm/kmmscand.c                 | 1144 +++++++++++++++++++++++++++++++++
 mm/memory.c                   |   12 +-
 mm/vmstat.c                   |   14 +
 14 files changed, 1352 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/kmmscand.h
 create mode 100644 mm/kmmscand.c


base-commit: bcc8eda6d34934d80b96adb8dc4ff5dfc632a53a
-- 
2.39.3
Re: [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit
Posted by SeongJae Park 1 year, 1 month ago
Hello Raghavendra,


Thank you for posting this nice patch series.  I gave you some feedback
offline.  Adding those here again for transparency on this grateful public
discussion.

On Sun, 1 Dec 2024 15:38:08 +0000 Raghavendra K T <raghavendra.kt@amd.com> wrote:

> Introduction:
> =============
> This patchset is an outcome of an ongoing collaboration between AMD and Meta.
> Meta wanted to explore an alternative page promotion technique as they
> observe high latency spikes in their workloads that access CXL memory.
> 
> In the current hot page promotion, all the activities including the
> process address space scanning, NUMA hint fault handling and page
> migration is performed in the process context. i.e., scanning overhead is
> borne by applications.

Yet another approach is using DAMON.  DAMON does access monitoring, and further
allows users to request access pattern-driven system operations in name of
DAMOS (Data Access Monitoring-based Operation Schemes).  Using it, users can
request DAMON to find hot pages and promote, while finding cold pages and
demote.  SK hynix has made their CXL-based memory capacity expansion solution
in the way (https://github.com/skhynix/hmsdk/wiki/Capacity-Expansion).  We
collaboratively developed new DAMON features for that, and those are all
in the mainline since Linux v6.11.

I also proposed an idea for advancing it using DAMOS auto-tuning on more
general (>2 tiers) setup
(https:lore.kernel.org/20231112195602.61525-1-sj@kernel.org).  I haven't had a
time to further implement and test the idea so far, though.

> 
> This is an early RFC patch series to do (slow tier) CXL page promotion.
> The approach in this patchset assists/addresses the issue by adding PTE
> Accessed bit scanning.
> 
> Scanning is done by a global kernel thread which routinely scans all
> the processes' address spaces and checks for accesses by reading the
> PTE A bit. It then migrates/promotes the pages to the toptier node
> (node 0 in the current approach).
> 
> Thus, the approach pushes overhead of scanning, NUMA hint faults and
> migrations off from process context.

DAMON also uses PTE A bit as major source of the access information.  And DAMON
does both access monitoring and promotion/demotion in a global kernel thread,
namely kdamond.  Hence the DAMON-based approach would also offload the
overheads from process context.  So I feel your approach has a sort of
similarity with DAMON-based one in a way, and we might have a chance to avoid
unnecessary duplicates.

[...]
> 
> Limitations:
> ===========
> PTE A bit scanning approach lacks information about exact destination
> node to migrate to.

This is same for DAMON-based approach, since DAMON also uses PTE A bit as the
major source of the information.  We aim to extend DAMON to aware of the access
source CPU, and use it for solving this problem, though.  Utilizing page faults
or AMD IBS-like h/w features are on the table of the ideas.

> 
> Notes/Observations on design/Implementations/Alternatives/TODOs...
> ================================
> 1. Fine-tuning scan throttling

DAMON allows users set the upper-limit of monitoring overhead, using
max_nr_regions parameter.  Then it provides its best-effort accuracy.  We also
have ongoing projects for making it more accurate and easier to tune.

> 
> 2. Use migrate_balanced_pgdat() to balance toptier node before migration
>  OR Use migrate_misplaced_folio_prepare() directly.
>  But it may need some optimizations (for e.g., invoke occasionaly so
> that overhead is not there for every migration).
> 
> 3. Explore if a separate PAGE_EXT flag is needed instead of reusing
> PAGE_IDLE flag (cons: complicates PTE A bit handling in the system),
> But practically does not look good idea.
> 
> 4. Use timestamp information-based migration (Similar to numab mode=2).
> instead of migrating immediately when PTE A bit set.
> (cons:
>  - It will not be accurate since it is done outside of process
> context.
>  - Performance benefit may be lost.)

DAMON provides a sort of time-based aggregated monitoring results.  And DAMOS
provides prioritization of pages based on the access temperature.  Hence,
DAMON-based apparoach can also be used for a similar purpose (promoting not
every accessed pages but pages that more frequently used for longer time).

> 
> 5. Explore if we need to use PFN information + hash list instead of
> simple migration list. Here scanning is directly done with PFN belonging
> to CXL node.

DAMON supports physical address space monitoring, and maintains the access
monitoring results in its own data structure called damon_region.  So I think
similar benefit can be achieved using DAMON?

[...]
> 8. Using DAMON APIs OR Reusing part of DAMON which already tracks range of
> physical addresses accessed.

My biased humble opinion is that it would be very nice to explore this
opportunity, since I show some similarities and opportunities to solve some of
challenges on your approach in an easier way.  Even if it turns out that DAMON
cannot be used for your use case, failing earlier is a good thing, I'd say :)

> 
> 9. Gregory has nicely mentioned some details/ideas on different approaches in
> [1] : development notes, in the context of promoting unmapped page cache folios.

DAMON supports monitoring accesses to unmapped page cache folios, so hopefully
DAMON-based approaches can also solve this issue.

> 
> 10. SJ had pointed about concerns about kernel-thread based approaches as in
> kstaled [2]. So current patchset has tried to address the issue with simple
> algorithms to reduce CPU overhead. Migration throttling, Running the daemon
> in NICE priority, Parallelizing migration with scanning could help further.
> 
> 11. Toptier pages scanned can be used to assist current NUMAB by providing information
> on hot VMAs.
> 
> Credits
> =======
> Thanks to Bharata, Joannes, Gregory, SJ, Chris for their valuable comments and
> support.

I also learned many things from the great discussions, thank you :)

[...]
> 
> Links:
> [1] https://lore.kernel.org/lkml/20241127082201.1276-1-gourry@gourry.net/
> [2] kstaled: https://lore.kernel.org/lkml/1317170947-17074-3-git-send-email-walken@google.com/#r
> [3] https://lore.kernel.org/lkml/Y+Pj+9bbBbHpf6xM@hirez.programming.kicks-ass.net/
> 
> I might have CCed more people or less people than needed
> unintentionally.


Thanks,
SJ

[...]
Re: [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit
Posted by Raghavendra K T 1 year, 1 month ago

On 12/11/2024 12:23 AM, SeongJae Park wrote:
> Hello Raghavendra,
> 
> 
> Thank you for posting this nice patch series.  I gave you some feedback
> offline.  Adding those here again for transparency on this grateful public
> discussion.
> 
> On Sun, 1 Dec 2024 15:38:08 +0000 Raghavendra K T <raghavendra.kt@amd.com> wrote:
> 
>> Introduction:
>> =============
>> This patchset is an outcome of an ongoing collaboration between AMD and Meta.
>> Meta wanted to explore an alternative page promotion technique as they
>> observe high latency spikes in their workloads that access CXL memory.
>>
>> In the current hot page promotion, all the activities including the
>> process address space scanning, NUMA hint fault handling and page
>> migration is performed in the process context. i.e., scanning overhead is
>> borne by applications.
> 
> Yet another approach is using DAMON.  DAMON does access monitoring, and further
> allows users to request access pattern-driven system operations in name of
> DAMOS (Data Access Monitoring-based Operation Schemes).  Using it, users can
> request DAMON to find hot pages and promote, while finding cold pages and
> demote.  SK hynix has made their CXL-based memory capacity expansion solution
> in the way (https://github.com/skhynix/hmsdk/wiki/Capacity-Expansion).  We
> collaboratively developed new DAMON features for that, and those are all
> in the mainline since Linux v6.11.
> > I also proposed an idea for advancing it using DAMOS auto-tuning on more
> general (>2 tiers) setup
> (https:lore.kernel.org/20231112195602.61525-1-sj@kernel.org).  I haven't had a
> time to further implement and test the idea so far, though.
> 
>>
>> This is an early RFC patch series to do (slow tier) CXL page promotion.
>> The approach in this patchset assists/addresses the issue by adding PTE
>> Accessed bit scanning.
>>
>> Scanning is done by a global kernel thread which routinely scans all
>> the processes' address spaces and checks for accesses by reading the
>> PTE A bit. It then migrates/promotes the pages to the toptier node
>> (node 0 in the current approach).
>>
>> Thus, the approach pushes overhead of scanning, NUMA hint faults and
>> migrations off from process context.
> 
> DAMON also uses PTE A bit as major source of the access information.  And DAMON
> does both access monitoring and promotion/demotion in a global kernel thread,
> namely kdamond.  Hence the DAMON-based approach would also offload the
> overheads from process context.  So I feel your approach has a sort of
> similarity with DAMON-based one in a way, and we might have a chance to avoid
> unnecessary duplicates.
> 
> [...]
>>
>> Limitations:
>> ===========
>> PTE A bit scanning approach lacks information about exact destination
>> node to migrate to.
> 
> This is same for DAMON-based approach, since DAMON also uses PTE A bit as the
> major source of the information.  We aim to extend DAMON to aware of the access
> source CPU, and use it for solving this problem, though.  Utilizing page faults
> or AMD IBS-like h/w features are on the table of the ideas.
> 
>>
>> Notes/Observations on design/Implementations/Alternatives/TODOs...
>> ================================
>> 1. Fine-tuning scan throttling
> 
> DAMON allows users set the upper-limit of monitoring overhead, using
> max_nr_regions parameter.  Then it provides its best-effort accuracy.  We also
> have ongoing projects for making it more accurate and easier to tune.
> 
>>
>> 2. Use migrate_balanced_pgdat() to balance toptier node before migration
>>   OR Use migrate_misplaced_folio_prepare() directly.
>>   But it may need some optimizations (for e.g., invoke occasionaly so
>> that overhead is not there for every migration).
>>
>> 3. Explore if a separate PAGE_EXT flag is needed instead of reusing
>> PAGE_IDLE flag (cons: complicates PTE A bit handling in the system),
>> But practically does not look good idea.
>>
>> 4. Use timestamp information-based migration (Similar to numab mode=2).
>> instead of migrating immediately when PTE A bit set.
>> (cons:
>>   - It will not be accurate since it is done outside of process
>> context.
>>   - Performance benefit may be lost.)
> 
> DAMON provides a sort of time-based aggregated monitoring results.  And DAMOS
> provides prioritization of pages based on the access temperature.  Hence,
> DAMON-based apparoach can also be used for a similar purpose (promoting not
> every accessed pages but pages that more frequently used for longer time).
> 
>>
>> 5. Explore if we need to use PFN information + hash list instead of
>> simple migration list. Here scanning is directly done with PFN belonging
>> to CXL node.
> 
> DAMON supports physical address space monitoring, and maintains the access
> monitoring results in its own data structure called damon_region.  So I think
> similar benefit can be achieved using DAMON?
> 
> [...]
>> 8. Using DAMON APIs OR Reusing part of DAMON which already tracks range of
>> physical addresses accessed.
> 
> My biased humble opinion is that it would be very nice to explore this
> opportunity, since I show some similarities and opportunities to solve some of
> challenges on your approach in an easier way.  Even if it turns out that DAMON
> cannot be used for your use case, failing earlier is a good thing, I'd say :)
> 
>>
>> 9. Gregory has nicely mentioned some details/ideas on different approaches in
>> [1] : development notes, in the context of promoting unmapped page cache folios.
> 
> DAMON supports monitoring accesses to unmapped page cache folios, so hopefully
> DAMON-based approaches can also solve this issue.
> 

Hello SJ,

Thank you for detailed explanation again. (Sorry for late
acknowledgement as I was looking forward to MM alignment discussion when
this message came).

I think once the direction is fixed, we could surely use / Reuse lot
source code from DAMON, MGLRU. Amazing design of DAMON should surely
help. Will keep in mind all the points raised here.

Thanks and Regards
- Raghu
Re: [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit
Posted by Davidlohr Bueso 11 months, 4 weeks ago
On Sun, 01 Dec 2024, Raghavendra K T wrote:

>6. Holding PTE lock before migration.

fyi I tried testing this series with 'perf-bench numa mem' and got a soft lockup,
unable to take the PTL (and lost the machine to debug further atm), ie:

[ 3852.217675] CPU: 127 UID: 0 PID: 12537 Comm: watch-numa-sche Tainted: G      D      L     6.14.0-rc2-kmmscand-v1+ #3
[ 3852.217677] Tainted: [D]=DIE, [L]=SOFTLOCKUP
[ 3852.217678] RIP: 0010:native_queued_spin_lock_slowpath+0x64/0x290
[ 3852.217683] Code: 77 7b f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00 00 00 77 57 85 c0 74 10 0f b6 03 84 c0 74 09 f3 90 <0f> b6 03 84 c0 75 f7 b8 01 00 00 00 66 89 03 5b 5d 41 5c 41 5d c3
[ 3852.217684] RSP: 0018:ff274259b3c9f988 EFLAGS: 00000202
[ 3852.217685] RAX: 0000000000000001 RBX: ffbd2efd8c08c9a8 RCX: 000ffffffffff000
[ 3852.217686] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffbd2efd8c08c9a8
[ 3852.217687] RBP: ff161328422c1328 R08: ff274259b3c9fb90 R09: ff161328422c1000
[ 3852.217688] R10: 00000000ffffffff R11: 0000000000000004 R12: 00007f52cca00000
[ 3852.217688] R13: ff274259b3c9fa00 R14: ff16132842326000 R15: ff161328422c1328
[ 3852.217689] FS:  00007f32b6f92b80(0000) GS:ff161423bfd80000(0000) knlGS:0000000000000000
[ 3852.217691] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3852.217692] CR2: 0000564ddbf68008 CR3: 00000080a81cc005 CR4: 0000000000773ef0
[ 3852.217693] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3852.217694] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 3852.217694] PKRU: 55555554
[ 3852.217695] Call Trace:
[ 3852.217696]  <IRQ>
[ 3852.217697]  ? watchdog_timer_fn+0x21b/0x2a0
[ 3852.217699]  ? __pfx_watchdog_timer_fn+0x10/0x10
[ 3852.217702]  ? __hrtimer_run_queues+0x10f/0x2a0
[ 3852.217704]  ? hrtimer_interrupt+0xfb/0x240
[ 3852.217706]  ? __sysvec_apic_timer_interrupt+0x4e/0x110
[ 3852.217709]  ? sysvec_apic_timer_interrupt+0x68/0x90
[ 3852.217712]  </IRQ>
[ 3852.217712]  <TASK>
[ 3852.217713]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 3852.217717]  ? native_queued_spin_lock_slowpath+0x64/0x290
[ 3852.217720]  _raw_spin_lock+0x25/0x30
[ 3852.217723]  __pte_offset_map_lock+0x9a/0x110
[ 3852.217726]  gather_pte_stats+0x1e3/0x2c0
[ 3852.217730]  walk_pgd_range+0x528/0xbb0
[ 3852.217733]  __walk_page_range+0x71/0x1d0
[ 3852.217736]  walk_page_vma+0x98/0xf0
[ 3852.217738]  show_numa_map+0x11a/0x3a0
[ 3852.217741]  seq_read_iter+0x2a6/0x470
[ 3852.217745]  seq_read+0x12b/0x170
[ 3852.217748]  vfs_read+0xe0/0x370
[ 3852.217751]  ? syscall_exit_to_user_mode+0x49/0x210
[ 3852.217755]  ? do_syscall_64+0x8a/0x190
[ 3852.217758]  ksys_read+0x6a/0xe0
[ 3852.217762]  do_syscall_64+0x7e/0x190
[ 3852.217765]  ? __memcg_slab_free_hook+0xd4/0x120
[ 3852.217768]  ? __x64_sys_close+0x38/0x80
[ 3852.217771]  ? kmem_cache_free+0x3bf/0x3e0
[ 3852.217774]  ? syscall_exit_to_user_mode+0x49/0x210
[ 3852.217777]  ? do_syscall_64+0x8a/0x190
[ 3852.217780]  ? do_syscall_64+0x8a/0x190
[ 3852.217783]  ? __irq_exit_rcu+0x3e/0xe0
[ 3852.217785]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Re: [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit
Posted by Raghavendra K T 11 months, 4 weeks ago

On 2/12/2025 10:32 PM, Davidlohr Bueso wrote:
> On Sun, 01 Dec 2024, Raghavendra K T wrote:
> 
>> 6. Holding PTE lock before migration.
> 
> fyi I tried testing this series with 'perf-bench numa mem' and got a 
> soft lockup,
> unable to take the PTL (and lost the machine to debug further atm), ie:
> 
> [ 3852.217675] CPU: 127 UID: 0 PID: 12537 Comm: watch-numa-sche Tainted: 
> G      D      L     6.14.0-rc2-kmmscand-v1+ #3
> [ 3852.217677] Tainted: [D]=DIE, [L]=SOFTLOCKUP
> [ 3852.217678] RIP: 0010:native_queued_spin_lock_slowpath+0x64/0x290
> [ 3852.217683] Code: 77 7b f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 
> 08 30 e4 09 d0 3d ff 00 00 00 77 57 85 c0 74 10 0f b6 03 84 c0 74 09 f3 
> 90 <0f> b6 03 84 c0 75 f7 b8 01 00 00 00 66 89 03 5b 5d 41 5c 41 5d c3
> [ 3852.217684] RSP: 0018:ff274259b3c9f988 EFLAGS: 00000202
> [ 3852.217685] RAX: 0000000000000001 RBX: ffbd2efd8c08c9a8 RCX: 
> 000ffffffffff000
> [ 3852.217686] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 
> ffbd2efd8c08c9a8
> [ 3852.217687] RBP: ff161328422c1328 R08: ff274259b3c9fb90 R09: 
> ff161328422c1000
> [ 3852.217688] R10: 00000000ffffffff R11: 0000000000000004 R12: 
> 00007f52cca00000
> [ 3852.217688] R13: ff274259b3c9fa00 R14: ff16132842326000 R15: 
> ff161328422c1328
> [ 3852.217689] FS:  00007f32b6f92b80(0000) GS:ff161423bfd80000(0000) 
> knlGS:0000000000000000
> [ 3852.217691] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3852.217692] CR2: 0000564ddbf68008 CR3: 00000080a81cc005 CR4: 
> 0000000000773ef0
> [ 3852.217693] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 3852.217694] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 
> 0000000000000400
> [ 3852.217694] PKRU: 55555554
> [ 3852.217695] Call Trace:
> [ 3852.217696]  <IRQ>
> [ 3852.217697]  ? watchdog_timer_fn+0x21b/0x2a0
> [ 3852.217699]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [ 3852.217702]  ? __hrtimer_run_queues+0x10f/0x2a0
> [ 3852.217704]  ? hrtimer_interrupt+0xfb/0x240
> [ 3852.217706]  ? __sysvec_apic_timer_interrupt+0x4e/0x110
> [ 3852.217709]  ? sysvec_apic_timer_interrupt+0x68/0x90
> [ 3852.217712]  </IRQ>
> [ 3852.217712]  <TASK>
> [ 3852.217713]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [ 3852.217717]  ? native_queued_spin_lock_slowpath+0x64/0x290
> [ 3852.217720]  _raw_spin_lock+0x25/0x30
> [ 3852.217723]  __pte_offset_map_lock+0x9a/0x110
> [ 3852.217726]  gather_pte_stats+0x1e3/0x2c0
> [ 3852.217730]  walk_pgd_range+0x528/0xbb0
> [ 3852.217733]  __walk_page_range+0x71/0x1d0
> [ 3852.217736]  walk_page_vma+0x98/0xf0
> [ 3852.217738]  show_numa_map+0x11a/0x3a0
> [ 3852.217741]  seq_read_iter+0x2a6/0x470
> [ 3852.217745]  seq_read+0x12b/0x170
> [ 3852.217748]  vfs_read+0xe0/0x370
> [ 3852.217751]  ? syscall_exit_to_user_mode+0x49/0x210
> [ 3852.217755]  ? do_syscall_64+0x8a/0x190
> [ 3852.217758]  ksys_read+0x6a/0xe0
> [ 3852.217762]  do_syscall_64+0x7e/0x190
> [ 3852.217765]  ? __memcg_slab_free_hook+0xd4/0x120
> [ 3852.217768]  ? __x64_sys_close+0x38/0x80
> [ 3852.217771]  ? kmem_cache_free+0x3bf/0x3e0
> [ 3852.217774]  ? syscall_exit_to_user_mode+0x49/0x210
> [ 3852.217777]  ? do_syscall_64+0x8a/0x190
> [ 3852.217780]  ? do_syscall_64+0x8a/0x190
> [ 3852.217783]  ? __irq_exit_rcu+0x3e/0xe0
> [ 3852.217785]  entry_SYSCALL_64_after_hwframe+0x76/0x7e


Hello David,

Thanks for reporting, details. Reproducer information helps me
to stabilize the code quickly. Micro-benchmark I used did not show any
issues. I will add PTL lock and also check the issue from my side..

(with multiple scanning threads, it could cause even more issues because
of more migration pressure, wondering if I should go ahead with more
stabilized single thread scanning version in the coming post)

Thanks and Regards
- Raghu