[PATCH v17 0/3] Improve proc RSS accuracy

Mathieu Desnoyers posted 3 patches 1 month, 2 weeks ago
There is a newer version of this series
.../core-api/percpu-counter-tree.rst          |  75 ++
include/linux/mm.h                            |  19 +-
include/linux/mm_types.h                      |  54 +-
include/linux/percpu_counter_tree.h           | 367 ++++++++++
include/trace/events/kmem.h                   |   2 +-
init/main.c                                   |   2 +
kernel/fork.c                                 |  22 +-
lib/Kconfig                                   |  12 +
lib/Makefile                                  |   1 +
lib/percpu_counter_tree.c                     | 690 ++++++++++++++++++
lib/tests/Makefile                            |   2 +
lib/tests/percpu_counter_tree_kunit.c         | 351 +++++++++
12 files changed, 1567 insertions(+), 30 deletions(-)
create mode 100644 Documentation/core-api/percpu-counter-tree.rst
create mode 100644 include/linux/percpu_counter_tree.h
create mode 100644 lib/percpu_counter_tree.c
create mode 100644 lib/tests/percpu_counter_tree_kunit.c
[PATCH v17 0/3] Improve proc RSS accuracy
Posted by Mathieu Desnoyers 1 month, 2 weeks ago
This series introduces the hierarchical tree counter (hpcc) to increase
accuracy of approximated RSS counters exposed through proc interfaces.

With a test program hopping across CPUs doing frequent mmap/munmap
operations, the upstream implementation approximation reaches a 1GB
delta from the precise value after a few minutes, compared to a 80MB
delta with the hierarchical counter. The hierarchical counter provides a
guaranteed maximum approximation inaccuracy of 192MB on that hardware
topology.

This series is based on
commit 0f2acd3148e0 Merge tag 'm68knommu-for-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu

The main changes since v16:
- Dropped OOM killer 2-pass task selection algorithm.
- Introduce Kunit tests.
- Only perform atomic increments of intermediate tree nodes when
  bits which are significant for carry propagation are being changed.

Andrew, this is meant to target 7.1 after the 7.0 merge window closes.

Thanks,

Mathieu

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>

Mathieu Desnoyers (3):
  lib: Introduce hierarchical per-cpu counters
  lib: Test hierarchical per-cpu counters
  mm: Improve RSS counter approximation accuracy for proc interfaces

 .../core-api/percpu-counter-tree.rst          |  75 ++
 include/linux/mm.h                            |  19 +-
 include/linux/mm_types.h                      |  54 +-
 include/linux/percpu_counter_tree.h           | 367 ++++++++++
 include/trace/events/kmem.h                   |   2 +-
 init/main.c                                   |   2 +
 kernel/fork.c                                 |  22 +-
 lib/Kconfig                                   |  12 +
 lib/Makefile                                  |   1 +
 lib/percpu_counter_tree.c                     | 690 ++++++++++++++++++
 lib/tests/Makefile                            |   2 +
 lib/tests/percpu_counter_tree_kunit.c         | 351 +++++++++
 12 files changed, 1567 insertions(+), 30 deletions(-)
 create mode 100644 Documentation/core-api/percpu-counter-tree.rst
 create mode 100644 include/linux/percpu_counter_tree.h
 create mode 100644 lib/percpu_counter_tree.c
 create mode 100644 lib/tests/percpu_counter_tree_kunit.c

-- 
2.39.5
Re: [PATCH v17 0/3] Improve proc RSS accuracy
Posted by Heiko Carstens 1 month ago
On Tue, Feb 17, 2026 at 11:10:03AM -0500, Mathieu Desnoyers wrote:
> This series introduces the hierarchical tree counter (hpcc) to increase
> accuracy of approximated RSS counters exposed through proc interfaces.
> 
> With a test program hopping across CPUs doing frequent mmap/munmap
> operations, the upstream implementation approximation reaches a 1GB
> delta from the precise value after a few minutes, compared to a 80MB
> delta with the hierarchical counter. The hierarchical counter provides a
> guaranteed maximum approximation inaccuracy of 192MB on that hardware
> topology.
> 
> This series is based on
> commit 0f2acd3148e0 Merge tag 'm68knommu-for-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
> 
> The main changes since v16:
> - Dropped OOM killer 2-pass task selection algorithm.
> - Introduce Kunit tests.
> - Only perform atomic increments of intermediate tree nodes when
>   bits which are significant for carry propagation are being changed.

This seems to cause crashes with linux-next on s390, at least I could bisect
it to the last patch of this series. Reverting the last one, makes the crashes
go away:

0acac6604c1cfd7a1762901f0a4abe87cf3a8619 is the first bad commit
commit 0acac6604c1cfd7a1762901f0a4abe87cf3a8619 (HEAD)
Author:     Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
AuthorDate: Tue Feb 17 11:10:06 2026 -0500
Commit:     Andrew Morton <akpm@linux-foundation.org>
CommitDate: Tue Feb 24 11:15:15 2026 -0800

    mm: improve RSS counter approximation accuracy for proc interfaces

Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 766d615f72615000 TEID: 766d615f72615803 ESOP-2 FSI
Fault in home space mode while using kernel ASCE.
AS:000000025dc04007 R3:0000000000000024
Oops: 0038 ilc:2 [#1]SMP
Modules linked in:
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-20260224.rc1.git266.3ef088b0c577.300.fc43.s390x+next #1 PREEMPTLAZY
Hardware name: IBM 3931 A01 703 (z/VM 7.4.0)
Krnl PSW : 0704c00180000000 00000216ef164cde (kernfs_name_hash+0x1e/0xb0)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000000 766d615f72615f65 0000000000000000
           766d615f72615f65 0000000000000000 0000000000000000 0000000000000000
           766d615f72615f65 0000000081212440 0000000000000000 0000000000000000
           0000000080a00000 00000216efcb5390 00000216ef16530c 00000196eeb07ae0
Krnl Code: 00000216ef164cd2: a7190000            lghi    %r1,0
           00000216ef164cd6: b9040042            lgr     %r4,%r2
          *00000216ef164cda: a7090000            lghi    %r0,0
          >00000216ef164cde: b25e0014            srst    %r1,%r4
           00000216ef164ce2: a714fffe            brc     1,00000216ef164cde
           00000216ef164ce6: b9e92051            sgrk    %r5,%r1,%r2
           00000216ef164cea: ec1200208076        crj     %r1,%r2,8,00000216ef164d2a
           00000216ef164cf0: b9160005            llgfr   %r0,%r5
Call Trace:
 [<00000216ef164cde>] kernfs_name_hash+0x1e/0xb0
 [<00000216ef167d32>] kernfs_remove_by_name_ns+0x72/0x120
 [<00000216ef16bbfa>] remove_files+0x4a/0x90
 [<00000216ef16bf96>] create_files+0x276/0x2b0
 [<00000216ef16c15a>] internal_create_group+0x18a/0x320
 [<00000216f09b61c6>] swap_init+0x5e/0xa0
 [<00000216eec7fb00>] do_one_initcall+0x40/0x270
 [<00000216f0990a40>] kernel_init_freeable+0x2b0/0x330
 [<00000216efb5160e>] kernel_init+0x2e/0x180
 [<00000216eec81ffc>] __ret_from_fork+0x3c/0x240
 [<00000216efb5e052>] ret_from_fork+0xa/0x30
Last Breaking-Event-Address:
 [<00000216ef165306>] kernfs_find_ns+0x76/0x140
Kernel panic - not syncing: Fatal exception: panic_on_oops
Re: [PATCH v17 0/3] Improve proc RSS accuracy
Posted by Andrew Morton 1 month ago
On Thu, 26 Feb 2026 13:04:22 +0100 Heiko Carstens <hca@linux.ibm.com> wrote:

> This seems to cause crashes with linux-next on s390, at least I could bisect
> it to the last patch of this series. Reverting the last one, makes the crashes
> go away:
> 
> 0acac6604c1cfd7a1762901f0a4abe87cf3a8619 is the first bad commit
> commit 0acac6604c1cfd7a1762901f0a4abe87cf3a8619 (HEAD)
> Author:     Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> AuthorDate: Tue Feb 17 11:10:06 2026 -0500
> Commit:     Andrew Morton <akpm@linux-foundation.org>
> CommitDate: Tue Feb 24 11:15:15 2026 -0800
> 
>     mm: improve RSS counter approximation accuracy for proc interfaces

Thanks, I'll remove this series from linux-next for now.
Re: [PATCH v17 0/3] Improve proc RSS accuracy
Posted by Mathieu Desnoyers 1 month ago
On 2026-02-26 07:04, Heiko Carstens wrote:
> On Tue, Feb 17, 2026 at 11:10:03AM -0500, Mathieu Desnoyers wrote:
>> This series introduces the hierarchical tree counter (hpcc) to increase
>> accuracy of approximated RSS counters exposed through proc interfaces.
>>
>> With a test program hopping across CPUs doing frequent mmap/munmap
>> operations, the upstream implementation approximation reaches a 1GB
>> delta from the precise value after a few minutes, compared to a 80MB
>> delta with the hierarchical counter. The hierarchical counter provides a
>> guaranteed maximum approximation inaccuracy of 192MB on that hardware
>> topology.
>>
>> This series is based on
>> commit 0f2acd3148e0 Merge tag 'm68knommu-for-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
>>
>> The main changes since v16:
>> - Dropped OOM killer 2-pass task selection algorithm.
>> - Introduce Kunit tests.
>> - Only perform atomic increments of intermediate tree nodes when
>>    bits which are significant for carry propagation are being changed.
> 
> This seems to cause crashes with linux-next on s390, at least I could bisect
> it to the last patch of this series. Reverting the last one, makes the crashes
> go away:
> 
> 0acac6604c1cfd7a1762901f0a4abe87cf3a8619 is the first bad commit
> commit 0acac6604c1cfd7a1762901f0a4abe87cf3a8619 (HEAD)
> Author:     Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> AuthorDate: Tue Feb 17 11:10:06 2026 -0500
> Commit:     Andrew Morton <akpm@linux-foundation.org>
> CommitDate: Tue Feb 24 11:15:15 2026 -0800
> 
>      mm: improve RSS counter approximation accuracy for proc interfaces
> 
> Unable to handle kernel pointer dereference in virtual kernel address space
> Failing address: 766d615f72615000 TEID: 766d615f72615803 ESOP-2 FSI
> Fault in home space mode while using kernel ASCE.
> AS:000000025dc04007 R3:0000000000000024
> Oops: 0038 ilc:2 [#1]SMP
> Modules linked in:
> CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-20260224.rc1.git266.3ef088b0c577.300.fc43.s390x+next #1 PREEMPTLAZY
> Hardware name: IBM 3931 A01 703 (z/VM 7.4.0)
> Krnl PSW : 0704c00180000000 00000216ef164cde (kernfs_name_hash+0x1e/0xb0)
>             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> Krnl GPRS: 0000000000000000 0000000000000000 766d615f72615f65 0000000000000000
>             766d615f72615f65 0000000000000000 0000000000000000 0000000000000000
>             766d615f72615f65 0000000081212440 0000000000000000 0000000000000000
>             0000000080a00000 00000216efcb5390 00000216ef16530c 00000196eeb07ae0
> Krnl Code: 00000216ef164cd2: a7190000            lghi    %r1,0
>             00000216ef164cd6: b9040042            lgr     %r4,%r2
>            *00000216ef164cda: a7090000            lghi    %r0,0
>            >00000216ef164cde: b25e0014            srst    %r1,%r4
>             00000216ef164ce2: a714fffe            brc     1,00000216ef164cde
>             00000216ef164ce6: b9e92051            sgrk    %r5,%r1,%r2
>             00000216ef164cea: ec1200208076        crj     %r1,%r2,8,00000216ef164d2a
>             00000216ef164cf0: b9160005            llgfr   %r0,%r5
> Call Trace:
>   [<00000216ef164cde>] kernfs_name_hash+0x1e/0xb0
>   [<00000216ef167d32>] kernfs_remove_by_name_ns+0x72/0x120
>   [<00000216ef16bbfa>] remove_files+0x4a/0x90
>   [<00000216ef16bf96>] create_files+0x276/0x2b0
>   [<00000216ef16c15a>] internal_create_group+0x18a/0x320
>   [<00000216f09b61c6>] swap_init+0x5e/0xa0
>   [<00000216eec7fb00>] do_one_initcall+0x40/0x270
>   [<00000216f0990a40>] kernel_init_freeable+0x2b0/0x330
>   [<00000216efb5160e>] kernel_init+0x2e/0x180
>   [<00000216eec81ffc>] __ret_from_fork+0x3c/0x240
>   [<00000216efb5e052>] ret_from_fork+0xa/0x30
> Last Breaking-Event-Address:
>   [<00000216ef165306>] kernfs_find_ns+0x76/0x140
> Kernel panic - not syncing: Fatal exception: panic_on_oops

It looks like either an issue with ordering of the bootup sequence, or
an issue with the size of struct mm_struct init_mm. I'll have a look.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [PATCH v17 0/3] Improve proc RSS accuracy
Posted by Mathieu Desnoyers 1 month ago
On 2026-02-26 10:00, Mathieu Desnoyers wrote:
> On 2026-02-26 07:04, Heiko Carstens wrote:
>> On Tue, Feb 17, 2026 at 11:10:03AM -0500, Mathieu Desnoyers wrote:
>>> This series introduces the hierarchical tree counter (hpcc) to increase
>>> accuracy of approximated RSS counters exposed through proc interfaces.
>>>
>>> With a test program hopping across CPUs doing frequent mmap/munmap
>>> operations, the upstream implementation approximation reaches a 1GB
>>> delta from the precise value after a few minutes, compared to a 80MB
>>> delta with the hierarchical counter. The hierarchical counter provides a
>>> guaranteed maximum approximation inaccuracy of 192MB on that hardware
>>> topology.
>>>
>>> This series is based on
>>> commit 0f2acd3148e0 Merge tag 'm68knommu-for-v7.0' of git:// 
>>> git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
>>>
>>> The main changes since v16:
>>> - Dropped OOM killer 2-pass task selection algorithm.
>>> - Introduce Kunit tests.
>>> - Only perform atomic increments of intermediate tree nodes when
>>>    bits which are significant for carry propagation are being changed.
>>
>> This seems to cause crashes with linux-next on s390, at least I could 
>> bisect
>> it to the last patch of this series. Reverting the last one, makes the 
>> crashes
>> go away:
>>
>> 0acac6604c1cfd7a1762901f0a4abe87cf3a8619 is the first bad commit
>> commit 0acac6604c1cfd7a1762901f0a4abe87cf3a8619 (HEAD)
>> Author:     Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> AuthorDate: Tue Feb 17 11:10:06 2026 -0500
>> Commit:     Andrew Morton <akpm@linux-foundation.org>
>> CommitDate: Tue Feb 24 11:15:15 2026 -0800
>>
>>      mm: improve RSS counter approximation accuracy for proc interfaces
>>
>> Unable to handle kernel pointer dereference in virtual kernel address 
>> space
>> Failing address: 766d615f72615000 TEID: 766d615f72615803 ESOP-2 FSI
>> Fault in home space mode while using kernel ASCE.
>> AS:000000025dc04007 R3:0000000000000024
>> Oops: 0038 ilc:2 [#1]SMP
>> Modules linked in:
>> CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 
>> 7.0.0-20260224.rc1.git266.3ef088b0c577.300.fc43.s390x+next #1 PREEMPTLAZY
>> Hardware name: IBM 3931 A01 703 (z/VM 7.4.0)
>> Krnl PSW : 0704c00180000000 00000216ef164cde (kernfs_name_hash+0x1e/0xb0)
>>             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>> Krnl GPRS: 0000000000000000 0000000000000000 766d615f72615f65 
>> 0000000000000000
>>             766d615f72615f65 0000000000000000 0000000000000000 
>> 0000000000000000
>>             766d615f72615f65 0000000081212440 0000000000000000 
>> 0000000000000000
>>             0000000080a00000 00000216efcb5390 00000216ef16530c 
>> 00000196eeb07ae0
>> Krnl Code: 00000216ef164cd2: a7190000            lghi    %r1,0
>>             00000216ef164cd6: b9040042            lgr     %r4,%r2
>>            *00000216ef164cda: a7090000            lghi    %r0,0
>>            >00000216ef164cde: b25e0014            srst    %r1,%r4
>>             00000216ef164ce2: a714fffe            brc     
>> 1,00000216ef164cde
>>             00000216ef164ce6: b9e92051            sgrk    %r5,%r1,%r2
>>             00000216ef164cea: ec1200208076        crj     %r1, 
>> %r2,8,00000216ef164d2a
>>             00000216ef164cf0: b9160005            llgfr   %r0,%r5
>> Call Trace:
>>   [<00000216ef164cde>] kernfs_name_hash+0x1e/0xb0
>>   [<00000216ef167d32>] kernfs_remove_by_name_ns+0x72/0x120
>>   [<00000216ef16bbfa>] remove_files+0x4a/0x90
>>   [<00000216ef16bf96>] create_files+0x276/0x2b0
>>   [<00000216ef16c15a>] internal_create_group+0x18a/0x320
>>   [<00000216f09b61c6>] swap_init+0x5e/0xa0
>>   [<00000216eec7fb00>] do_one_initcall+0x40/0x270
>>   [<00000216f0990a40>] kernel_init_freeable+0x2b0/0x330
>>   [<00000216efb5160e>] kernel_init+0x2e/0x180
>>   [<00000216eec81ffc>] __ret_from_fork+0x3c/0x240
>>   [<00000216efb5e052>] ret_from_fork+0xa/0x30
>> Last Breaking-Event-Address:
>>   [<00000216ef165306>] kernfs_find_ns+0x76/0x140
>> Kernel panic - not syncing: Fatal exception: panic_on_oops
> 
> It looks like either an issue with ordering of the bootup sequence, or
> an issue with the size of struct mm_struct init_mm. I'll have a look.

I've successfully booted a linux-next 7.0.0-rc1-next-20260226 within a
x86-64 vm, with a swap partition.

I wonder if s390x somehow alters the value of nr_cpus_ids late in
bootup, after percpu_counter_tree_subsystem_init() ?

Can you share your .config and kernel command line arguments ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [PATCH v17 0/3] Improve proc RSS accuracy
Posted by Mathieu Desnoyers 1 month ago
On 2026-02-26 10:42, Mathieu Desnoyers wrote:
> On 2026-02-26 10:00, Mathieu Desnoyers wrote:
>> On 2026-02-26 07:04, Heiko Carstens wrote:
>>> On Tue, Feb 17, 2026 at 11:10:03AM -0500, Mathieu Desnoyers wrote:
>>>> This series introduces the hierarchical tree counter (hpcc) to increase
>>>> accuracy of approximated RSS counters exposed through proc interfaces.
>>>>
>>>> With a test program hopping across CPUs doing frequent mmap/munmap
>>>> operations, the upstream implementation approximation reaches a 1GB
>>>> delta from the precise value after a few minutes, compared to a 80MB
>>>> delta with the hierarchical counter. The hierarchical counter 
>>>> provides a
>>>> guaranteed maximum approximation inaccuracy of 192MB on that hardware
>>>> topology.
>>>>
>>>> This series is based on
>>>> commit 0f2acd3148e0 Merge tag 'm68knommu-for-v7.0' of git:// 
>>>> git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
>>>>
>>>> The main changes since v16:
>>>> - Dropped OOM killer 2-pass task selection algorithm.
>>>> - Introduce Kunit tests.
>>>> - Only perform atomic increments of intermediate tree nodes when
>>>>    bits which are significant for carry propagation are being changed.
>>>
>>> This seems to cause crashes with linux-next on s390, at least I could 
>>> bisect
>>> it to the last patch of this series. Reverting the last one, makes 
>>> the crashes
>>> go away:
>>>
>>> 0acac6604c1cfd7a1762901f0a4abe87cf3a8619 is the first bad commit
>>> commit 0acac6604c1cfd7a1762901f0a4abe87cf3a8619 (HEAD)
>>> Author:     Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>> AuthorDate: Tue Feb 17 11:10:06 2026 -0500
>>> Commit:     Andrew Morton <akpm@linux-foundation.org>
>>> CommitDate: Tue Feb 24 11:15:15 2026 -0800
>>>
>>>      mm: improve RSS counter approximation accuracy for proc interfaces
>>>
>>> Unable to handle kernel pointer dereference in virtual kernel address 
>>> space
>>> Failing address: 766d615f72615000 TEID: 766d615f72615803 ESOP-2 FSI
>>> Fault in home space mode while using kernel ASCE.
>>> AS:000000025dc04007 R3:0000000000000024
>>> Oops: 0038 ilc:2 [#1]SMP
>>> Modules linked in:
>>> CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 
>>> 7.0.0-20260224.rc1.git266.3ef088b0c577.300.fc43.s390x+next #1 
>>> PREEMPTLAZY
>>> Hardware name: IBM 3931 A01 703 (z/VM 7.4.0)
>>> Krnl PSW : 0704c00180000000 00000216ef164cde 
>>> (kernfs_name_hash+0x1e/0xb0)
>>>             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>> Krnl GPRS: 0000000000000000 0000000000000000 766d615f72615f65 
>>> 0000000000000000
>>>             766d615f72615f65 0000000000000000 0000000000000000 
>>> 0000000000000000
>>>             766d615f72615f65 0000000081212440 0000000000000000 
>>> 0000000000000000
>>>             0000000080a00000 00000216efcb5390 00000216ef16530c 
>>> 00000196eeb07ae0
>>> Krnl Code: 00000216ef164cd2: a7190000            lghi    %r1,0
>>>             00000216ef164cd6: b9040042            lgr     %r4,%r2
>>>            *00000216ef164cda: a7090000            lghi    %r0,0
>>>            >00000216ef164cde: b25e0014            srst    %r1,%r4
>>>             00000216ef164ce2: a714fffe            brc 1,00000216ef164cde
>>>             00000216ef164ce6: b9e92051            sgrk    %r5,%r1,%r2
>>>             00000216ef164cea: ec1200208076        crj     %r1, 
>>> %r2,8,00000216ef164d2a
>>>             00000216ef164cf0: b9160005            llgfr   %r0,%r5
>>> Call Trace:
>>>   [<00000216ef164cde>] kernfs_name_hash+0x1e/0xb0
>>>   [<00000216ef167d32>] kernfs_remove_by_name_ns+0x72/0x120
>>>   [<00000216ef16bbfa>] remove_files+0x4a/0x90
>>>   [<00000216ef16bf96>] create_files+0x276/0x2b0
>>>   [<00000216ef16c15a>] internal_create_group+0x18a/0x320
>>>   [<00000216f09b61c6>] swap_init+0x5e/0xa0
>>>   [<00000216eec7fb00>] do_one_initcall+0x40/0x270
>>>   [<00000216f0990a40>] kernel_init_freeable+0x2b0/0x330
>>>   [<00000216efb5160e>] kernel_init+0x2e/0x180
>>>   [<00000216eec81ffc>] __ret_from_fork+0x3c/0x240
>>>   [<00000216efb5e052>] ret_from_fork+0xa/0x30
>>> Last Breaking-Event-Address:
>>>   [<00000216ef165306>] kernfs_find_ns+0x76/0x140
>>> Kernel panic - not syncing: Fatal exception: panic_on_oops
>>
>> It looks like either an issue with ordering of the bootup sequence, or
>> an issue with the size of struct mm_struct init_mm. I'll have a look.
> 
> I've successfully booted a linux-next 7.0.0-rc1-next-20260226 within a
> x86-64 vm, with a swap partition.
> 
> I wonder if s390x somehow alters the value of nr_cpus_ids late in
> bootup, after percpu_counter_tree_subsystem_init() ?
> 
> Can you share your .config and kernel command line arguments ?

I've successfully booted a defconfig s390x next-20260226 kernel in qemu
with 1 and 4 CPUs, and within a nested s390x VM on 2 cpus.

I guess I'll really need more info about your specific .config and
command line args to help further.

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [PATCH v17 0/3] Improve proc RSS accuracy
Posted by Nathan Chancellor 1 month ago
Hi Mathieu,

On Thu, Feb 26, 2026 at 02:38:04PM -0500, Mathieu Desnoyers wrote:
> I've successfully booted a defconfig s390x next-20260226 kernel in qemu
> with 1 and 4 CPUs, and within a nested s390x VM on 2 cpus.
> 
> I guess I'll really need more info about your specific .config and
> command line args to help further.

FWIW, the ClangBuiltLinux CI sees a boot failure with sparc64_defconfig,
which does not appear to be clang specific. I can reproduce it here
with:

  $ make -skj"$(nproc)" ARCH=sparc CROSS_COMPILE=sparc64-linux- mrproper sparc64_defconfig image

  $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/sparc64-rootfs.cpio.zst | zstd -d >rootfs.cpio

  $ qemu-system-sparc64 \
      -display none \
      -nodefaults \
      -M sun4u \
      -cpu 'TI UltraSparc IIi' \
      -append console=ttyS0 \
      -kernel arch/sparc/boot/image \
      -initrd rootfs.cpio \
      -m 1G \
      -serial mon:stdio
  ...
  [    0.001502] Linux version 7.0.0-rc1+ (nathan@framework-amd-ryzen-maxplus-395) (sparc64-linux-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.45) #1 SMP Thu Feb 26 18:00:08 MST 2026
  ...
  [    1.339282] Run /init as init process
  [    1.340037] Unable to handle kernel NULL pointer dereference
  [    1.340515] tsk->{mm,active_mm}->context = 0000000000000000
  [    1.340684] tsk->{mm,active_mm}->pgd = fffff80000402000
  [    1.340838]               \|/ ____ \|/
  [    1.340838]               "@'/ .. \`@"
  [    1.340838]               /_| \__/ |_\
  [    1.340838]                  \__U_/
  [    1.341260] swapper/0(1): Oops [#1]
  [    1.341575] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc1+ #1 VOLUNTARY
  [    1.341937] TSTATE: 0000004411001606 TPC: 0000000000465674 TNPC: 0000000000465678 Y: 00000021    Not tainted
  [    1.342199] TPC: <init_new_context+0x14/0xc0>
  [    1.342584] g0: 0000000000000000 g1: 0000000000000000 g2: 0000000000000000 g3: 0000000000000000
  [    1.342815] g4: fffff80004170000 g5: fffff8003e1d2000 g6: fffff80004134000 g7: fffff8003f814598
  [    1.343047] o0: fffff8000479c0a0 o1: 0000000000000000 o2: 0000000000002000 o3: 0000000000000000
  [    1.343276] o4: fffff80004621200 o5: fffff80004006e00 sp: fffff80004137331 ret_pc: 000000000062a520
  [    1.343513] RPC: <kmem_cache_alloc_noprof+0x1c0/0x560>
  [    1.343681] l0: ffffffffffffffff l1: 0000000000000000 l2: 0000000000000001 l3: 0000000000000000
  [    1.343917] l4: 000000004fd6805e l5: 0000000001503c00 l6: 0000000001423800 l7: 0000000000000012
  [    1.344144] i0: fffff80004170000 i1: fffff8000479c0a0 i2: 0000000000472030 i3: 000000000180ac00
  [    1.344371] i4: 0000000000000000 i5: fffff8000400b500 i6: fffff800041373e1 i7: 0000000000472044
  [    1.344601] I7: <mm_init.isra.0+0x144/0x1e0>
  [    1.344751] Call Trace:
  [    1.344871] [<0000000000472044>] mm_init.isra.0+0x144/0x1e0
  [    1.345054] [<00000000006602ec>] alloc_bprm+0xcc/0x1e0
  [    1.345195] [<0000000000660e6c>] kernel_execve+0x2c/0x1c0
  [    1.345345] [<0000000000be4060>] kernel_init+0x70/0x128
  [    1.345496] [<00000000004060f0>] ret_from_fork+0x24/0x34
  [    1.345652] [<0000000000000000>] 0x0
  [    1.345823] Disabling lock debugging due to kernel taint
  [    1.346046] Caller[0000000000472044]: mm_init.isra.0+0x144/0x1e0
  [    1.346229] Caller[00000000006602ec]: alloc_bprm+0xcc/0x1e0
  [    1.346388] Caller[0000000000660e6c]: kernel_execve+0x2c/0x1c0
  [    1.346553] Caller[0000000000be4060]: kernel_init+0x70/0x128
  [    1.346707] Caller[00000000004060f0]: ret_from_fork+0x24/0x34
  [    1.346864] Caller[0000000000000000]: 0x0
  [    1.346985] Instruction DUMP:
  [    1.347007]  92102000
  [    1.347135]  90100019
  [    1.347204]  f85e6490
  [    1.347281] <c6588000>
  [    1.347349]  c25e6388
  [    1.347416]  fa5e6488
  [    1.347485]  82004003
  [    1.347552]  c65e63c0
  [    1.347619]  83784c00
  [    1.347689]
  [    1.348041] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
  [    1.348649] Press Stop-A (L1-A) from sun keyboard or send break
  [    1.348649] twice on console to return to the boot prom
  [    1.348940] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---

This series is not bisectable to see which specific patch causes this,
as I get

  In file included from mm/init-mm.c:2:
  include/linux/mm_types.h:1419:57: error: 'PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE' undeclared here (not in a function)
   1419 |         [0 ... sizeof(cpumask_t) + MM_CID_STATIC_SIZE + PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE. - 1] = 0  \
        |                                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  mm/init-mm.c:50:27: note: in expansion of macro 'MM_STRUCT_FLEXIBLE_ARRAY_INIT'
     50 |         .flexible_array = MM_STRUCT_FLEXIBLE_ARRAY_INIT,
        |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  include/linux/mm_types.h:1419:10: error: array index in initializer not of integer type
   1419 |         [0 ... sizeof(cpumask_t) + MM_CID_STATIC_SIZE + PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE - 1] = 0  \
        |          ^
  mm/init-mm.c:50:27: note: in expansion of macro 'MM_STRUCT_FLEXIBLE_ARRAY_INIT'
     50 |         .flexible_array = MM_STRUCT_FLEXIBLE_ARRAY_INIT,
        |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  include/linux/mm_types.h:1419:10: note: (near initialization for 'init_mm.flexible_array')
   1419 |         [0 ... sizeof(cpumask_t) + MM_CID_STATIC_SIZE + PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE - 1] = 0  \
        |          ^
  mm/init-mm.c:50:27: note: in expansion of macro 'MM_STRUCT_FLEXIBLE_ARRAY_INIT'
     50 |         .flexible_array = MM_STRUCT_FLEXIBLE_ARRAY_INIT,
        |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

prior to this change that removes PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE.

Cheers,
Nathan
Re: [PATCH v17 0/3] Improve proc RSS accuracy
Posted by Mathieu Desnoyers 1 month ago
On 2026-02-26 20:12, Nathan Chancellor wrote:
> 
> FWIW, the ClangBuiltLinux CI sees a boot failure with sparc64_defconfig,
> which does not appear to be clang specific. I can reproduce it here
> with:

I found the issue with the info provided by Heiko. Thanks for the
reproducer!

> This series is not bisectable to see which specific patch causes this,
> as I get
> 
>    In file included from mm/init-mm.c:2:
>    include/linux/mm_types.h:1419:57: error: 'PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE' undeclared here (not in a function)
>     1419 |         [0 ... sizeof(cpumask_t) + MM_CID_STATIC_SIZE + PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE. - 1] = 0  \
>          |                                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    mm/init-mm.c:50:27: note: in expansion of macro 'MM_STRUCT_FLEXIBLE_ARRAY_INIT'
>       50 |         .flexible_array = MM_STRUCT_FLEXIBLE_ARRAY_INIT,
>          |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/mm_types.h:1419:10: error: array index in initializer not of integer type
>     1419 |         [0 ... sizeof(cpumask_t) + MM_CID_STATIC_SIZE + PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE - 1] = 0  \
>          |          ^
>    mm/init-mm.c:50:27: note: in expansion of macro 'MM_STRUCT_FLEXIBLE_ARRAY_INIT'
>       50 |         .flexible_array = MM_STRUCT_FLEXIBLE_ARRAY_INIT,
>          |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/mm_types.h:1419:10: note: (near initialization for 'init_mm.flexible_array')
>     1419 |         [0 ... sizeof(cpumask_t) + MM_CID_STATIC_SIZE + PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE - 1] = 0  \
>          |          ^
>    mm/init-mm.c:50:27: note: in expansion of macro 'MM_STRUCT_FLEXIBLE_ARRAY_INIT'
>       50 |         .flexible_array = MM_STRUCT_FLEXIBLE_ARRAY_INIT,
>          |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> prior to this change that removes PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE.

Good catch. I will fix this as well for v18.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [PATCH v17 0/3] Improve proc RSS accuracy
Posted by Heiko Carstens 1 month ago
On Thu, Feb 26, 2026 at 06:12:01PM -0700, Nathan Chancellor wrote:
> Hi Mathieu,
> 
> On Thu, Feb 26, 2026 at 02:38:04PM -0500, Mathieu Desnoyers wrote:
> > I've successfully booted a defconfig s390x next-20260226 kernel in qemu
> > with 1 and 4 CPUs, and within a nested s390x VM on 2 cpus.
> > 
> > I guess I'll really need more info about your specific .config and
> > command line args to help further.

On s390 cpumask_set_cpu(0, mm_cpumask(&init_mm)); in arch_mm_preinit() writes
out-of-bounds into swap_attrs[] overwriting the terminating NULL.

This seems to happen because the return value of get_rss_stat_items_size() is
larger than PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE:

PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE: 18688
get_rss_stat_items_size(): 21504

Here I stopped looking further into this. I guess you will figure out
immediately what's wrong :)
Re: [PATCH v17 0/3] Improve proc RSS accuracy
Posted by Mathieu Desnoyers 1 month ago
On 2026-02-27 08:11, Heiko Carstens wrote:
> On Thu, Feb 26, 2026 at 06:12:01PM -0700, Nathan Chancellor wrote:
>> Hi Mathieu,
>>
>> On Thu, Feb 26, 2026 at 02:38:04PM -0500, Mathieu Desnoyers wrote:
>>> I've successfully booted a defconfig s390x next-20260226 kernel in qemu
>>> with 1 and 4 CPUs, and within a nested s390x VM on 2 cpus.
>>>
>>> I guess I'll really need more info about your specific .config and
>>> command line args to help further.
> 
> On s390 cpumask_set_cpu(0, mm_cpumask(&init_mm)); in arch_mm_preinit() writes
> out-of-bounds into swap_attrs[] overwriting the terminating NULL.
> 
> This seems to happen because the return value of get_rss_stat_items_size() is
> larger than PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE:
> 
> PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE: 18688
> get_rss_stat_items_size(): 21504
> 
> Here I stopped looking further into this. I guess you will figure out
> immediately what's wrong :)

Indeed!

So in get_rss_stat_items_size() we have:

static inline size_t get_rss_stat_items_size(void)
{
         return percpu_counter_tree_items_size() * NR_MM_COUNTERS;
}

And just above:

#define MM_STRUCT_FLEXIBLE_ARRAY_INIT                                                                   \
{                                                                                                       \
         [0 ... PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE + sizeof(cpumask_t) + MM_CID_STATIC_SIZE - 1] = 0  \
}

Which fails to account for NR_MM_COUNTERS. Does the following fix your issue ?

#define MM_STRUCT_FLEXIBLE_ARRAY_INIT                                                                   \
{                                                                                                       \
         [0 ... (PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE * NR_MM_COUNTERS) + sizeof(cpumask_t) + MM_CID_STATIC_SIZE - 1] = 0  \
}

It would only cause issues when nr_cpu_ids grows closer to NR_CPUS, which explains
why I could not reproduce it locally.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [PATCH v17 0/3] Improve proc RSS accuracy
Posted by Heiko Carstens 1 month ago
On Fri, Feb 27, 2026 at 08:25:46AM -0500, Mathieu Desnoyers wrote:
> On 2026-02-27 08:11, Heiko Carstens wrote:
> > This seems to happen because the return value of get_rss_stat_items_size() is
> > larger than PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE:
> > 
> > PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE: 18688
> > get_rss_stat_items_size(): 21504

...

> Which fails to account for NR_MM_COUNTERS. Does the following fix your issue ?
> 
> #define MM_STRUCT_FLEXIBLE_ARRAY_INIT                                                                   \
> {                                                                                                       \
>         [0 ... (PERCPU_COUNTER_TREE_ITEMS_STATIC_SIZE * NR_MM_COUNTERS) + sizeof(cpumask_t) + MM_CID_STATIC_SIZE - 1] = 0  \
> }

Yes, this works.